Closed dennisameling closed 1 year ago
Could we also rename the workflow? bundle-artifacts
was kind of okay for a step in the git-artifacts
workflow (even if I never was really happy with the non-descriptive name), but if we extract this into its stand-alone workflow, I think we'll want a better name.
How about tag-git
?
@dscho, very excited to mention that this is now working on all three architectures with their own, dedicated SDKs 🚀🎉😄
tag-git
: successful run (on ubuntu-latest
!!!) ✅ git-artifacts x86_64
: successful run ✅ git-artifacts i686
: successful run ✅ git-artifacts aarch64
: successful run ✅ (except for installer
and mingit-busybox
for which I'll need to provide fixes in build-extra
)One issue I encountered is that arm64 runners cannot be deployed in parallel due to duplicate deployment names (azure-arm-template
). Let's set the deployment name explicitly, which should hopefully prevent these issues.
Once the first "official" GfW arm64 version is released, let's also update the Azure VM post-deployment script so that it downloads the arm64
version instead of amd64
. This will benefit some CI workflows (or parts of those) that don't use the git-sdk-arm64
but rely on the system-installed Git.
(except for
installer
andmingit-busybox
for which I'll need to provide fixes in build-extra)
Ok, let's not support mingit-busybox
on arm64 for now. It's based on the package in MINGW-packages/mingw-w64-busybox
which in turn builds git-for-windows/w32-busybox
. That has configs for mingw32
and mingw64
, but not seeing anything clang
-related there. I tried adding clangarm64_defconfig
, but it looks like that'll need quite a bit more fixes. Someone should probably first try to build clang64
before even considering clangarm64
. Even the author of the upstream repo hasn't been able to create a working build using the Clang toolchain yet. Shall I just add some logic to completely skip the mingit-busybox
artifact if we're on arm64? I don't have the bandwidth now to dive into this further.
Regarding the installer
artifact: the artifact itself can be built already, but the part that fails is ./please.sh bundle_pdbs --arch=aarch64
. That method currently only supports i686
/x86_64
so will need a bit of refactoring to support aarch64
as well.
Shall I just add some logic to completely skip the
mingit-busybox
artifact if we're on arm64?
I think that's the most sensible course of action.
Regarding the
installer
artifact: the artifact itself can be built already, but the part that fails is./please.sh bundle_pdbs --arch=aarch64
. That method currently only supportsi686
/x86_64
so will need a bit of refactoring to supportaarch64
as well.
Right. It looks as if that might be the last big obstacle, though, amirite?
Right. It looks as if that might be the last big obstacle, though, amirite?
Yes!!! 😍 Will provide a PR later today so we can push all of this over the finish line. So exciting!
Yes!!! 😍 Will provide a PR later today so we can push all of this over the finish line. So exciting!
@dscho here we go!!!! 🚀 🔥 🎉
git-artifacts x86_64
: successful run ✅ git-artifacts i686
: successful run ✅ git-artifacts aarch64
: successful run ✅ Small note: while the arm64
runners are created in parallel now, your Azure subscription seems to hit a quota of CPU cores 😅 I guess that just means that too many VMs are active at the same time. The limit is 20, so that means that two 8-core VMs would be the maximum already. We could use the 4-core machines instead so there can be 5 active VMs at the same time. WDYT?
here we go!!!! 🚀 🔥 🎉
git-artifacts x86_64
: successful run ✅
git-artifacts i686
: successful run ✅
git-artifacts aarch64
: successful run ✅
Yay! Time to merge this PR, then?
your Azure subscription seems to hit a quota of CPU cores 😅 I guess that just means that too many VMs are active at the same time. The limit is 20, so that means that two 8-core VMs would be the maximum already. We could use the 4-core machines instead so there can be 5 active VMs at the same time.
That sounds very sensible to me.
Hooray!
Yay! Anything I can help with still to get the arm64 binaries in one of the upcoming releases?
Two minor things from my side:
git-artifacts
pipeline from the main repo as it's no longer needed there.git-extra
changes. Looks like it's not possible to show that in some UI, but you can clone https://github.com/git-for-windows/git.wiki.git
and look for the update-git-extra-docs
branch. 😊
- I have some pending documentation updates regarding the earlier
git-extra
changes. Looks like it's not possible to show that in some UI, but you can clonehttps://github.com/git-for-windows/git.wiki.git
and look for theupdate-git-extra-docs
branch. 😊
Merged & pushed 😃
@dscho how is the tag-git
pipeline supposed to work for existing tags (e.g. v2.39.2.windows.1)? I tried both the commit ref and the tag on that pipeline, but both failed (here and here). I can't kick off a git-artifacts
pipeline for aarch64
before having run bundle-artifacts
. It should probably just take the existing tag and create the bundle-artifacts
artifact, right?
@dennisameling since building Git for Windows/ARM64 is probably not going to be a recurrent thing, now that tag-git
and git-artifacts
are working, how about simply branching off, say, git-2.39.2-aarch64
, and modifying the workflow as a one-off? I.e.
workflow_dispatch:
section by a single push:
env
section to hard-code the values that would have been workflow_dispatch
inputs as well as the runs-on:
directives (search for github.event.inputs
)Get bundle-artifacts download URL
and Download bundle-artifacts zip
steps and replace it with a step that creates a bundle-artifacts/
directory containing:
build-extra.bundle
(probably as simple as git --git-dir=/usr/src/build-extra/.git bundle create bundle-artifacts/build-extra.bundle main^..main
)git.bundle
(maybe as simple as git --git-dir=/usr/src/git bundle create bundle-artifacts/git.bundle origin/main^..v2.39.2.windows.1
, but maybe there is no clone yet and you might have to jump through hoops like git init --bare .tmp.git && git ---git-dir .tmp.git fetch --filter=blob:none --depth=1 refs/tags/v2.39.2.windows.1:refs/tags/v2.39.2.windows.1
)display_version
, next_version
and ver
(I think they contain the same text, you can call echo "2.39.2.windows.1" >bundle-artifacts/display_version
)Then push that branch to git-for-windows/git-for-windows-automation
and let it build the artifacts. Does that sound like a good plan to you?
FWIW I just had a look at the artifacts of the latest regular Git for Windows build, i.e. one with a publicly visible "Git artifacts" run. And I was slightly wrong about the contents of that .zip
file:
ver
contains the text 2.39.0.2
display_version
contains the text 2.39.0(2)
next_version
contains the text v2.39.0.windows.2
the bundles do look like I expected them:
$ git ls-remote build-extra.bundle
ba4d2eed1e7c4af68bd473a0e9e7bf7ee539de34 refs/heads/main
$ git ls-remote git.bundle
0fd2e6d703178f7d631e46e914a66d55982f357c refs/tags/v2.39.0.windows.2
I think we can create those files in said branch "simulating an unpacked bundle-artifacts.zip", get the workflow to build the Windows/ARM64 artifacts and then add them to the release page of Git for Windows v2.39.2.
Oh, and for the sake of keeping track of history, I would then "fake-merge" that branch, once the artifacts were built correctly (i.e. git merge -s ours <remote-branch>
.
@dennisameling after thinking about this, I believe that it makes most sense to add that functionality to the workflow instead of keeping it only in a topic branch (and then essentially dropping it via -s ours
). I.e. I want to introduce a new input
that takes the name of an existing tag and another one for the build-extra
revision to use. I'll get to it immediately and report back.
@dennisameling I experienced the same hang that you reported here. I connected via RDP and investigated a little bit. I did not come terribly far, as it was late at night and eventually I had to give up and forgot to write down the details.
But here is what I remember: there were two makepkg-mingw
processes, one being the child of the other one, and when I attached to the child process using gdb
I saw four threads but somehow gdb
was unhappy with the frame pointer and therefore I could not see any stacktraces.
Since the makepkg-mingw
program is actually a shell script, I am fairly certain that the child process (which was reported by wmic process
to have the exact same command-line as its parent process) was essentially a subshell, probably executing git rev-parse --short HEAD
.
On a hunch that the subshell was stuck waiting for something after doing its job, e.g. a signal
or something similar, I asked gdb
to force-stop ("kill", I don't want to know who found such a nomenclature sensible shudder) that child process, after the workflow had run for around 20 minutes and hung for a good while. And don't you know, this "fixed" it, and it continued! You can see it here (you have to click on the sprocket wheel next to the "Search logs" text box and enable "Show timestamps"): the pkgver
step was started at Wed, 15 Feb 2023 22:57:53 GMT and should not have taken more than a couple of seconds, yet the workflow run only continued at Wed, 15 Feb 2023 23:39:18 GMT (i.e. over 40 minutes later), after I had stopped that child process forcefully.
Unfortunately, I had forgotten that the git-artifacts
workflow insists on making a MINGW-packages
commit (which it cannot do because we're not actually modifying PKGBUILD
), and the build failed at that point, even if it had successfully built the Pacman packages.
It's hanging again. Let me document better what I did. First of all, I RDP'ed into the machine, then opened a Git Bash. In that Bash, I looked for the process that would be hanging by first calling wmic process where 'ExecutablePath like "%git%"' get ProcessId,ParentProcessId,CommandLine
, seeing that it's probably the signtool script, and then calling:
$ wmic process where 'CommandLine like "%sign%"' get ProcessId,ParentProcessId,CommandLine
CommandLine ParentProcessId ProcessId
D:\git-sdk-arm64-build-installers\usr\bin\make.exe -f ../mingw-w64-git.mak sign-executables 4408 6100
D:\git-sdk-arm64-build-installers\usr\bin\sh.exe -c "eval git --git-dir=D:/a/_work/git-for-windows-automation/git-for-windows-automation/.git signtool headless-git.exe git-daemon.exe git-http-backend.exe git-imap-send.exe git-sh-i18n--envsubst.exe git-shell.exe git-http-fetch.exe git-http-push.exe git-remote-http.exe git-remote-https.exe git-remote-ftp.exe git-remote-ftps.exe \ contrib/credential/wincred/git-credential-wincred.exe git.exe \ cmd/git{,-gui,k}.exe compat-bash.exe git-{bash,cmd,wrapper}.exe" 1872 2684
D:\git-sdk-arm64-build-installers\usr\bin\sh.exe -c "eval git --git-dir=D:/a/_work/git-for-windows-automation/git-for-windows-automation/.git signtool headless-git.exe git-daemon.exe git-http-backend.exe git-imap-send.exe git-sh-i18n--envsubst.exe git-shell.exe git-http-fetch.exe git-http-push.exe git-remote-http.exe git-remote-https.exe git-remote-ftp.exe git-remote-ftps.exe \ contrib/credential/wincred/git-credential-wincred.exe git.exe \ cmd/git{,-gui,k}.exe compat-bash.exe git-{bash,cmd,wrapper}.exe" 2684 5676
D:\git-sdk-arm64-build-installers\clangarm64\bin\git.exe --git-dir=D:/a/_work/git-for-windows-automation/git-for-windows-automation/.git signtool headless-git.exe git-daemon.exe git-http-backend.exe git-imap-send.exe git-sh-i18n--envsubst.exe git-shell.exe git-http-fetch.exe git-http-push.exe git-remote-http.exe git-remote-https.exe git-remote-ftp.exe git-remote-ftps.exe contrib/credential/wincred/git-credential-wincred.exe git.exe cmd/git.exe cmd/git-gui.exe cmd/gitk.exe compat-bash.exe git-bash.exe git-cmd.exe git-wrapper.exe 5676 5200
sh -c "sh \"/usr/src/build-extra/signtool.sh\" \"$@\"" "sh \"/usr/src/build-extra/signtool.sh\"" headless-git.exe git-daemon.exe git-http-backend.exe git-imap-send.exe git-sh-i18n--envsubst.exe git-shell.exe git-http-fetch.exe git-http-push.exe git-remote-http.exe git-remote-https.exe git-remote-ftp.exe git-remote-ftps.exe contrib/credential/wincred/git-credential-wincred.exe git.exe cmd/git.exe cmd/git-gui.exe cmd/gitk.exe compat-bash.exe git-bash.exe git-cmd.exe git-wrapper.exe 5200 5024
C:\Windows\System32\Wbem\wmic.exe process where "CommandLine like \"%sign%\"" get ProcessId,ParentProcessId,CommandLine 10112 10016
So indeed, there were quite a few, and the process with ID 5024 seemed to be the bottom-most one. To verify, I called:
$ wmic process where 'ParentProcessId=5024' get ProcessId,ParentProcessId,CommandLine
No Instance(s) Available.
Okay. Now on to debug. For that, I need gdb.exe
. So I cloned the entire SDK using the command-line git -C /d/ clone --depth=1 https://github.com/git-for-windows/git-sdk-arm64
in the Git Bash. After that, I opened a Git SDK Bash by pressing the Windows button and then typing D:\git-sdk-arm64\git-bash.exe
Enter.
Unfortunately, that does not work because of the OpenSSL situation where Git for Windows stays with v1.1.1 as long as possible while MSYS2 already switched to v3.:
$ gdb.exe
D:/git-sdk-arm64/usr/bin/gdb.exe: error while loading shared libraries: msys-python3.11.dll: cannot open shared object file: No such file or directory
So I downloaded MSYS2's python
package via curl -LO https://repo.msys2.org/msys/x86_64/python-3.11.2-1-x86_64.pkg.tar.zst
and then installed it via pacman -U ./python-3.11.2-1-x86_64.pkg.tar.zst
. This let me start gdb.exe bash.exe 5024
(that stanza asks gdb
to load the gdb.exe
binary and attach to the process with ID 5024).
In that GDB session, I then tried to see the current threads:
(gdb) info thread
Id Target Id Frame
1 Thread 5024.0x10fc error return ../../gdb-11.1/gdb/windows-nat.c:609 was 31
0x0000000000000000 in ?? ()
2 Thread 5024.0x1978 0x00007fff30ad6a64 in ?? ()
3 Thread 5024.0x18a8 0x00007fff30ad69c4 in ?? ()
4 Thread 5024.0x6a8 error return ../../gdb-11.1/gdb/windows-nat.c:609 was 31
0x0000000000000000 in ?? ()
5 Thread 5024.0x1468 0x00007fff30ad69c4 in ?? ()
* 6 Thread 5024.0x2058 0x00007fff30ae2374 in ?? ()
Not very informative, so let's try to see the current thread's stacktrace:
(gdb) bt
#0 0x00007fff30ae2374 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
That's it.
My next attempt involved downloading Sysinternals' Process Explorer. That gave me more information. Apparently there is one thread with this stack trace:
ntoskrnl.exe!ZwUpdateWnfStateData+0xaf60
ntoskrnl.exe!KeUpdateThreadTag+0xdc98
ntoskrnl.exe!KiDispatchInterrupt+0x7298
ntoskrnl.exe!KeWaitForMutexObject+0x440
ntoskrnl.exe!KeUpdateThreadTag+0xa754
ntoskrnl.exe!KiDeliverApc+0x18c
ntoskrnl.exe!KeUpdateThreadTag+0xdd1c
ntoskrnl.exe!KiDispatchInterrupt+0x7298
ntoskrnl.exe!KeWaitForMutexObject+0x440
ntoskrnl.exe!ObWaitForSingleObject+0xb4
ntoskrnl.exe!NtWaitForSingleObject+0x34
ntoskrnl.exe!ZwUpdateWnfStateData+0xadf0
ntoskrnl.exe!ZwUpdateWnfStateData+0xa9e8
!ZwWaitLowEventPair+0x158284
!ZwWaitLowEventPair+0x19d190
!ZwWaitLowEventPair+0x1950a8
!ZwWaitLowEventPair+0x1a9ad0
!ZwWaitLowEventPair+0x290f80
!ZwWaitLowEventPair+0x29ff14
!_assert+0x41c6
So that looks as if it called _assert()
. And another one with this stack trace:
ntoskrnl.exe!ZwUpdateWnfStateData+0xaf60
ntoskrnl.exe!KeUpdateThreadTag+0xdc98
ntoskrnl.exe!KiDispatchInterrupt+0x7298
ntoskrnl.exe!KeWaitForMutexObject+0x440
ntoskrnl.exe!FsRtlNotifyVolumeEventEx+0x4470
ntoskrnl.exe!NtQueryVolumeInformationFile+0xfe0
ntoskrnl.exe!NtReadFile+0xb0
ntoskrnl.exe!ZwUpdateWnfStateData+0xadf0
ntoskrnl.exe!ZwUpdateWnfStateData+0xa9e8
!ZwWaitLowEventPair+0x158324
!ZwWaitLowEventPair+0x178bc8
!PackageSidFromProductId+0x2318c0
!PackageSidFromProductId+0x26a360
!sigfillset+0x2d61
This looks as if it was inside the sigfillset()
function. But that does not make too much sense because that function does not call any other function.
The other threads' stack trace can either not be accessed or look like they're completely outside of Bash's and MSYS2 runtime's source code:
ntoskrnl.exe!ZwUpdateWnfStateData+0xaf60
ntoskrnl.exe!KeUpdateThreadTag+0xdc98
ntoskrnl.exe!KiDispatchInterrupt+0x7298
ntoskrnl.exe!KeWaitForMutexObject+0x440
ntoskrnl.exe!KeUpdateThreadTag+0xa754
ntoskrnl.exe!KiDeliverApc+0x18c
ntoskrnl.exe!KeUpdateThreadTag+0xdd1c
ntoskrnl.exe!KiDispatchInterrupt+0x7298
ntoskrnl.exe!KeDelayExecutionThread+0x614
ntoskrnl.exe!PsWow64IsMachineSupported+0x2d04
ntoskrnl.exe!ZwUpdateWnfStateData+0xadf0
ntoskrnl.exe!ZwUpdateWnfStateData+0xa9e8
!ZwWaitLowEventPair+0x1616f4
!ZwWaitLowEventPair+0x1eea48
!ZwWaitLowEventPair+0x1edea8
!ZwWaitLowEventPair+0x1ebe40
!ZwWaitLowEventPair+0x18495c
!ZwWaitLowEventPair+0x187a54
!ZwWaitLowEventPair+0x1ac658
!ZwWaitLowEventPair+0x1abc70
!ZwWaitLowEventPair+0x1ad458
!uaw_wcsrchr+0x81c70
!ZwWaitLowEventPair+0x1a9bc8
I'm out of ideas what to investigate on that side, so I downloaded Sysinternals' handle
utility to see what handles are open in that process, but that also fails to shed any light into the issue:
$ ./handle64a.exe -p 5024
Nthandle v5.0 - Handle viewer
Copyright (C) 1997-2022 Mark Russinovich
Sysinternals - www.sysinternals.com
180: Section \BaseNamedObjects\msys-2.0S5-c2a8dd8be8845dc5\shared.5
184: Section \BaseNamedObjects\msys-2.0S5-c2a8dd8be8845dc5\S-1-5-20.1
1DC: File (RWD) D:\git-sdk-arm64-build-installers\usr\src\MINGW-packages\mingw-w64-git\src\git
1EC: Section \BaseNamedObjects\msys-2.0S5-c2a8dd8be8845dc5\cygpid.10728
2C0: Section \BaseNamedObjects\msys-2.0S5-c2a8dd8be8845dc5\cygpid.10728
So for now I'll just call it yet another unexplained hang and try to let the workflow run continue, at least, by force-stopping that process: wmic process where ProcessID=5024 delete
.
And sure enough, the workflow run continues...
Darn, darn, darn. Again I had to RDP into the runner and force-stop the signtool
process in order to let it complete (toggle the timestamps to see the 24 minute gap where there should be less than 10 seconds).
This is not good. In this shape, we cannot publish a Git for Windows/ARM64. If it hangs for us so often, it will hang for plenty of users, too.
And it's hanging again, and also here. I worked around this by RDP'ing into the machines and running essentially:
candidates= &&
for pid in $(
wmic process where 'ExecutablePath like "%arm64-build-installers%"' get ProcessId |
sed 1d
)
do
test "z${pid#[0-9]}" != "z$pid" || continue
case "$(wmic process where parentprocessid=$pid get processid 2>&1 >/dev/null)" in
"No Instance(s) Available"*)
candidates="$candidates $pid"
;;
esac
done &&
wmic process where processid=${candidates##* } delete
I reached out to #cygwin-devel
on libera.chat, and @lazka pointed me to https://github.com/msys2/msys2-autobuild/issues/62, noting:
it only happening for those two things and not in say autotools/configure/make seems to suggest it's not a general problem (those two things being gpgme signature verification, and install-info)
And @jturney said:
"Not a known issue, but now we know about it.." 😉
@jeremyd2019 tried to debug our hangs at one point
yeah, I couldn't get anywhere though, none of the debuggers I tried to attach were able to get the thread context for the 'main' thread. there was also another thread running which seemed to be for cygwin signal handling, which seemed to be happily waiting. I gave up trying to debug it and instead added workarounds to avoid situations where it seemed to happen. Now that somebody else is reproducing it, maybe it can be debugged.
Regarding your _assert
and sigfillset
observations, notice the offsets from the symbols. It seems unlikely that those are actually the functions involved, just the nearest symbols it could find.
Regarding your
_assert
andsigfillset
observations, notice the offsets from the symbols. It seems unlikely that those are actually the functions involved, just the nearest symbols it could find.
Right, I feared so much.
Now that somebody else is reproducing it, maybe it can be debugged.
I fear that we'll have to get a much more reliable reproducer first.
Maybe the following will reproduce the hang reliably enough: calling the MINGW executable osslsigncode.exe
from the MSYS2 Bash script signtool.sh
, which in turn is called as a Git alias from the MINGW executable git.exe
that is called from either an MSYS2 Bash script or MSYS2 make.exe
.
At least that is what hung pretty reliably in those CI runs for me, in the last run it was every single time that call chain happened. But earlier runs did not hang after code-signing (even if they failed later on, but for independent reasons). The hang was at the end of the inner-most shell script just after all the code-signing was done.
Since you're RDP'ing into the machine anyway, could you maybe collect a full memory dump of a hanging process using procdump? Maybe that could shed some more light on what it's waiting for 🤔
... calling the MINGW executable
osslsigncode.exe
from the MSYS2 Bash scriptsigntool.sh
, which in turn is called as a Git alias from the MINGW executablegit.exe
...
Hmm, so git.exe
calls signtool.sh
, which signs git.exe
, produces git.signed.exe
(according to the script), and then moves it in place of the original? Sounds like the script tries to overwrite git.exe
while it's in process of executing the script. Or is it hanging while signing a different executable?
@Alovchin91 it's different git.exe
files. The running one is /clangarm64/bin/git.exe
, and the signed one is /usr/src/MINGW-packages/src/git/git.exe
. So that's not the problem.
Since you're RDP'ing into the machine anyway, could you maybe collect a full memory dump of a hanging process using procdump? Maybe that could shed some more light on what it's waiting for 🤔
@Alovchin91 I am curious what you suggest to do with that dump. Sure, I can collect that information, but what does that buy us that the previous investigation (#1 and #2) didn't reveal yet?
@dennisameling @jeremyd2019 did any of you manage to reproduce this in a non-CI setting, i.e. when starting a command interactively? I tried and failed to cause such a hang using interactive commands in an Azure VM, and would love to do that to accelerate the investigation.
@Alovchin91 I am curious what you suggest to do with that dump. Sure, I can collect that information, but what does that buy us that the previous investigation (#1 and #2) didn't reveal yet?
That would give some starting point to those who would like to help but cannot reproduce. From your very next sentence I understand that reproducing the issue is not a trivial task 😅 I've asked around on Mastodon if somebody (from Microsoft?) could help. That, of course, if you're looking for any help 😊
The hangs that I experienced with pacman (both signature verification and info database updates) would happen frequently while updating/installing packages interactively. I know I did try the process dump option in task manager, but that dump was also 'missing' the context for the main thread.
From my anecdotal observations, I think this seems more likely to occur on slower machines (happened all the time on the raspberry pi, but was more rare on the QC710), but perhaps more likely with more cores? (I was playing with a 60 core VM and I think it was more likely there than with 8 core QC710).
I am set up to run a KVM virtual machine on a raspberry pi, I can try to reproduce the hangs I got with pacman and try things, if anybody has any ideas to try (I can try procdump and post that somewhere if somebody thinks it would help, for example).
My workarounds for the pacman hangs I saw, which seem quite effective for building packages for msys2 project: https://github.com/jeremyd2019/winautoconfig/blob/544a3c019fcbd8955e91af3d4a86e12e3d1b6a49/msys2-runner-setup/setupscripts/setup.cmd#L28-L31
The hangs that I experienced with pacman (both signature verification and info database updates) would happen frequently while updating/installing packages interactively.
@jeremyd2019 did you find a minimal reproducer there, i.e. ideally a command that only verifies the signature but does not necessarily update any package?
Also, do you happen to know whether the signature verification happens to use a MINGW program to perform that verification (e.g. /clangarm64/bin/gpg.exe
)? In my cases, the hang always happened in the MSYS process associated with spawning a non-MSYS program...
@Alovchin91 the thing that makes me most dubious about the procdump
suggestion is that we already attached to the process, using gdb.exe
, and that neither procdump
nor any other regular Win32 tools have any idea about the MSYS2 runtime or about GCC's DWARF debug symbols and are therefore even less likely than gdb.exe
to show any useful information.
And FWIW there is already somebody from Microsoft helping this here ticket: me.
@jeremyd2019 did you find a minimal reproducer there, i.e. ideally a command that only verifies the signature but does not necessarily update any package?
No, when I tried I ended up reproducing a different hang that also happened on x64, and was able to be debugged and was fixed in Cygwin, so it wasn't a complete loss, but after that was fixed my minimal reproducer no longer reproduced the hang.
Also, do you happen to know whether the signature verification happens to use a MINGW program to perform that verification (e.g.
/clangarm64/bin/gpg.exe
)? In my cases, the hang always happened in the MSYS process associated with spawning a non-MSYS program...
No, it uses msys2 gpg, via gpgme.
@Alovchin91 the thing that makes me most dubious about the
procdump
suggestion is that we already attached to the process, usinggdb.exe
, and that neitherprocdump
nor any other regular Win32 tools have any idea about the MSYS2 runtime or about GCC's DWARF debug symbols and are therefore even less likely thangdb.exe
to show any useful information.
I tried gdb, windbg, the task manager process dump thing, even local kernel debugging, but was never able to get user mode context for that thread.
I think I heard that windbgx (the new version in the store) now supports DWARF symbols, but I haven't verified that myself, and I don't know if it would understand the split debug info that msys2-runtime uses. (does that even work right with 3.4? I thought I saw some additional stripping happening in msys2-runtime go by at one point)
when I tried I ended up reproducing a different hang that also happened on x64, and was able to be debugged and was fixed in Cygwin, so it wasn't a complete loss, but after that was fixed my minimal reproducer no longer reproduced the hang.
@jeremyd2019 that's interesting. It probably means that the hangs are related, at least it could give us a better idea which code path to look at. Would you happen to know the commit hash of the fix in Cygwin?
I think I heard that windbgx (the new version in the store) now supports DWARF symbols, but I haven't verified that myself, and I don't know if it would understand the split debug info that msys2-runtime uses.
Indeed. That's why I think attaching with gdb.exe
is our best bet to get any useful stack trace.
But I'm starting to believe that this is a dead end, we might never get a useful stack trace because the issue is more likely that something in MSYS2's tear-down code is failing to tear down the signal event handler thread, and that one is blocking the correct and expected termination of the process.
So I fear that we'll need to work with heavy instrumentation ("debug small_printf() messages littered all over the tear-down code path") and then compare the logs of a hanging to a non-hanging invocation of any minimal reproducer that we can come up with.
Would you happen to know the commit hash of the fix in Cygwin?
@jeremyd2019 this might actually help address the hangs I observed, as Git for Windows is still on the Cygwin v3.3.* train.
https://github.com/msys2/msys2-runtime/commit/0ce992c1e40f843d5f264e87908a6decfd6681d1. This was actually a crash, but the code was in such a state that it resulted in a hang in the exception handlers.
@jeremyd2019 thank you. Unfortunately my hope that this could help Git for Windows is dispelled by the fact that this commit made it into v3.3.4, and we're on v3.3.6 already, so we have that fix...
Update: I managed to reproduce the hang interactively, in my Azure VM, not on the first attempt, but the second. With a self-signed certificate and the /clangarm64/bin/git.exe
of git-for-windows/git-sdk-arm64, I ran this:
git -c alias.signtool='!sh "/usr/src/build-extra/signtool.sh"' signtool $PWD/g.exe
where /usr/src/build-extra/signtool.sh
comes from here and g.exe
was the copy of a just-built git.exe
(and now-already signed, due to the first attempt).
The self-signed certificate was created via:
$cert = New-SelfSignedCertificate -DnsName www.yourwebsite.com -Type CodeSigning -CertStoreLocation Cert:\CurrentUser\My
$CertPassword = ConvertTo-SecureString -String "my_passowrd" -Force -AsPlainText
Export-PfxCertificate -Cert (Get-ChildItem Cert:\CurrentUser\My -CodeSigningCert)[0] -FilePath ".sig\codesign.p12" -Password $CertPassword
"my_passowrd" | Out-File -NoNewline -Encoding ascii ".sig\codesign.pass"
In a second MinTTY, when I run ps
, I see this line:
2129 1 2128 8176 pty0 197108 20:16:51 /usr/bin/sh <defunct>
The full output of that ps
invocation is:
PID PPID PGID WINPID TTY UID STIME COMMAND
2021 1 2021 10056 ? 197108 20:14:27 /usr/bin/mintty
2129 1 2128 8176 pty0 197108 20:16:51 /usr/bin/sh <defunct>
2133 2132 2133 4132 pty1 197108 20:17:21 /usr/bin/bash
2132 1 2132 9120 ? 197108 20:17:20 /usr/bin/mintty
2217 2133 2217 1320 pty1 197108 20:28:19 /usr/bin/ps
2022 2021 2022 8444 pty0 197108 20:14:27 /usr/bin/bash
2128 2022 2128 6172 pty0 197108 20:16:51 /clangarm64/bin/git
Note that the pty0
processes refer to the hanging session, the pty1
processes are the ones running in the second MinTTY window where I run ps
.
The really strange thing about this is that no Win32 process has ProcessId 8176 (remember: the MSYS2/Cygwin runtime maintains its own pid that is different from the actual Win32 ProcessId), but there is no such process when I look via wmic process where parentprocessid=8176 get ProcessId,ParentProcessId,CommandLine
.
The parent process (MSYS2/Cygwin pid 2128, corresponding to the Win32 ProcessId 6172) refers to the git.exe
process that I called, i.e. that sh
should be the one that ran the signtool.sh
script. But when I look for such a process, I see a different Win32 ProcessId:
$ wmic process where 'CommandLine like "%signtool%"' get ProcessId,ParentProcessId,CommandLine
CommandLine ParentProcessId ProcessId
C:\git-sdk-arm64\clangarm64\bin\git.exe -c "alias.signtool=!sh \"/usr/src/build-extra/signtool.sh\"" signtool C:/git-sdk-arm64/usr/src/MINGW-packages/mingw-w64-git/g.exe 5652 6172
sh -c "sh \"/usr/src/build-extra/signtool.sh\" \"$@\"" "sh \"/usr/src/build-extra/signtool.sh\"" C:/git-sdk-arm64/usr/src/MINGW-packages/mingw-w64-git/g.exe 6172 6692
C:\Windows\System32\Wbem\wmic.exe process where "CommandLine like \"%signtool%\"" get ProcessId,ParentProcessId,CommandLine 6244 3288
The Win32 ProcessId of that sh
process is 6692, and there is no line in the ps
output corresponding to that process.
Another strange thing is that the ps
output says that the parent process of that git.exe
process has Win32 ProcessId 8444, but the wmic
output suggests that the parent process has Win32 ProcessId 5652. So I looked for that process and it seems that it is running, with the parent process being the one with Win32 ProceessId 8444 and the same cmdline as that:
$ wmic process where processid=5652 get ProcessId,ParentProcessId,CommandLine
CommandLine ParentProcessId ProcessId
C:\git-sdk-arm64\usr\bin\bash.exe --login -i 8444 5652
And FWIW there is already somebody from Microsoft helping this here ticket: me.
Oh, I didn't know that bit, sorry 😅 I was just trying to be helpful 😬
I've been thinking about procdump
because it can create a full dump. AFAIK task manager creates a minidump. My understanding is that it can be analysed by any tool afterwards. But that was just a suggestion and I don't think it's useful anymore 😀
Anyway, sorry for the distraction.
Well, here's a full procdump of a hung bash.exe, if anyone wants to try to learn anything from it.
WARNING: Teb 0 pointer is NULL - defaulting to 00000000`7ffde000
WARNING: 00000000`7ffde000 does not appear to be a TEB
WARNING: Thread 1bd0 context retrieval failure during dump writing, Win32 error 0n31
Seems familiar... So yes, this does still duplicate with 3.4 (this is msys2-runtime-3.4.5-1)
FWIW I could reproduce the hang on my Azure VM only once, and then no more :-(
You might try putting the system under load while trying to reproduce it (trying to replicate my 'slower machines' observation). This was involved when I was trying to come up with a reproducer (that wound up finding that other bug), I was basically stressing fork/exec hoping it would eventually hang.
Successful run: https://github.com/git-for-windows/git-for-windows-automation/actions/runs/3985052314