git-for-windows / git-for-windows-automation

A few GitHub workflows and support code to help with Git for Windows' day-to-day tasks
10 stars 9 forks source link

Port `git-artifacts` pipeline to `git-for-windows-automation` #22

Closed dennisameling closed 1 year ago

dennisameling commented 1 year ago

Successful run: https://github.com/git-for-windows/git-for-windows-automation/actions/runs/3985052314

dscho commented 1 year ago

Could we also rename the workflow? bundle-artifacts was kind of okay for a step in the git-artifacts workflow (even if I never was really happy with the non-descriptive name), but if we extract this into its stand-alone workflow, I think we'll want a better name.

How about tag-git?

dennisameling commented 1 year ago

@dscho, very excited to mention that this is now working on all three architectures with their own, dedicated SDKs 🚀🎉😄

One issue I encountered is that arm64 runners cannot be deployed in parallel due to duplicate deployment names (azure-arm-template). Let's set the deployment name explicitly, which should hopefully prevent these issues.

Once the first "official" GfW arm64 version is released, let's also update the Azure VM post-deployment script so that it downloads the arm64 version instead of amd64. This will benefit some CI workflows (or parts of those) that don't use the git-sdk-arm64 but rely on the system-installed Git.

dennisameling commented 1 year ago

(except for installer and mingit-busybox for which I'll need to provide fixes in build-extra)

Ok, let's not support mingit-busybox on arm64 for now. It's based on the package in MINGW-packages/mingw-w64-busybox which in turn builds git-for-windows/w32-busybox. That has configs for mingw32 and mingw64, but not seeing anything clang-related there. I tried adding clangarm64_defconfig, but it looks like that'll need quite a bit more fixes. Someone should probably first try to build clang64 before even considering clangarm64. Even the author of the upstream repo hasn't been able to create a working build using the Clang toolchain yet. Shall I just add some logic to completely skip the mingit-busybox artifact if we're on arm64? I don't have the bandwidth now to dive into this further.

Regarding the installer artifact: the artifact itself can be built already, but the part that fails is ./please.sh bundle_pdbs --arch=aarch64. That method currently only supports i686/x86_64 so will need a bit of refactoring to support aarch64 as well.

dscho commented 1 year ago

Shall I just add some logic to completely skip the mingit-busybox artifact if we're on arm64?

I think that's the most sensible course of action.

Regarding the installer artifact: the artifact itself can be built already, but the part that fails is ./please.sh bundle_pdbs --arch=aarch64. That method currently only supports i686/x86_64 so will need a bit of refactoring to support aarch64 as well.

Right. It looks as if that might be the last big obstacle, though, amirite?

dennisameling commented 1 year ago

Right. It looks as if that might be the last big obstacle, though, amirite?

Yes!!! 😍 Will provide a PR later today so we can push all of this over the finish line. So exciting!

dennisameling commented 1 year ago

Yes!!! 😍 Will provide a PR later today so we can push all of this over the finish line. So exciting!

Here we go!

dennisameling commented 1 year ago

@dscho here we go!!!! 🚀 🔥 🎉

Small note: while the arm64 runners are created in parallel now, your Azure subscription seems to hit a quota of CPU cores 😅 I guess that just means that too many VMs are active at the same time. The limit is 20, so that means that two 8-core VMs would be the maximum already. We could use the 4-core machines instead so there can be 5 active VMs at the same time. WDYT?

dscho commented 1 year ago

here we go!!!! 🚀 🔥 🎉

Yay! Time to merge this PR, then?

your Azure subscription seems to hit a quota of CPU cores 😅 I guess that just means that too many VMs are active at the same time. The limit is 20, so that means that two 8-core VMs would be the maximum already. We could use the 4-core machines instead so there can be 5 active VMs at the same time.

That sounds very sensible to me.

dennisameling commented 1 year ago

Yay! Time to merge this PR, then?

Sounds good! 🚀

That sounds very sensible to me.

Okay. Here we go!

dscho commented 1 year ago

Hooray!

dennisameling commented 1 year ago

Yay! Anything I can help with still to get the arm64 binaries in one of the upcoming releases?

Two minor things from my side:

dscho commented 1 year ago
  • I have some pending documentation updates regarding the earlier git-extra changes. Looks like it's not possible to show that in some UI, but you can clone https://github.com/git-for-windows/git.wiki.git and look for the update-git-extra-docs branch. 😊

Merged & pushed 😃

dennisameling commented 1 year ago

@dscho how is the tag-git pipeline supposed to work for existing tags (e.g. v2.39.2.windows.1)? I tried both the commit ref and the tag on that pipeline, but both failed (here and here). I can't kick off a git-artifacts pipeline for aarch64 before having run bundle-artifacts. It should probably just take the existing tag and create the bundle-artifacts artifact, right?

dscho commented 1 year ago

@dennisameling since building Git for Windows/ARM64 is probably not going to be a recurrent thing, now that tag-git and git-artifacts are working, how about simply branching off, say, git-2.39.2-aarch64, and modifying the workflow as a one-off? I.e.

Then push that branch to git-for-windows/git-for-windows-automation and let it build the artifacts. Does that sound like a good plan to you?

dscho commented 1 year ago

FWIW I just had a look at the artifacts of the latest regular Git for Windows build, i.e. one with a publicly visible "Git artifacts" run. And I was slightly wrong about the contents of that .zip file:

I think we can create those files in said branch "simulating an unpacked bundle-artifacts.zip", get the workflow to build the Windows/ARM64 artifacts and then add them to the release page of Git for Windows v2.39.2.

Oh, and for the sake of keeping track of history, I would then "fake-merge" that branch, once the artifacts were built correctly (i.e. git merge -s ours <remote-branch>.

dscho commented 1 year ago

@dennisameling after thinking about this, I believe that it makes most sense to add that functionality to the workflow instead of keeping it only in a topic branch (and then essentially dropping it via -s ours). I.e. I want to introduce a new input that takes the name of an existing tag and another one for the build-extra revision to use. I'll get to it immediately and report back.

dscho commented 1 year ago

I'll get to it immediately and report back.

Here is the PR, and here is the workflow run that I just started to build the Git for Windows v2.39.2 artifacts targeting ARM64.

dscho commented 1 year ago

@dennisameling I experienced the same hang that you reported here. I connected via RDP and investigated a little bit. I did not come terribly far, as it was late at night and eventually I had to give up and forgot to write down the details.

But here is what I remember: there were two makepkg-mingw processes, one being the child of the other one, and when I attached to the child process using gdb I saw four threads but somehow gdb was unhappy with the frame pointer and therefore I could not see any stacktraces.

Since the makepkg-mingw program is actually a shell script, I am fairly certain that the child process (which was reported by wmic process to have the exact same command-line as its parent process) was essentially a subshell, probably executing git rev-parse --short HEAD.

On a hunch that the subshell was stuck waiting for something after doing its job, e.g. a signal or something similar, I asked gdb to force-stop ("kill", I don't want to know who found such a nomenclature sensible shudder) that child process, after the workflow had run for around 20 minutes and hung for a good while. And don't you know, this "fixed" it, and it continued! You can see it here (you have to click on the sprocket wheel next to the "Search logs" text box and enable "Show timestamps"): the pkgver step was started at Wed, 15 Feb 2023 22:57:53 GMT and should not have taken more than a couple of seconds, yet the workflow run only continued at Wed, 15 Feb 2023 23:39:18 GMT (i.e. over 40 minutes later), after I had stopped that child process forcefully.

Unfortunately, I had forgotten that the git-artifacts workflow insists on making a MINGW-packages commit (which it cannot do because we're not actually modifying PKGBUILD), and the build failed at that point, even if it had successfully built the Pacman packages.

dscho commented 1 year ago

It's hanging again. Let me document better what I did. First of all, I RDP'ed into the machine, then opened a Git Bash. In that Bash, I looked for the process that would be hanging by first calling wmic process where 'ExecutablePath like "%git%"' get ProcessId,ParentProcessId,CommandLine, seeing that it's probably the signtool script, and then calling:

$ wmic process where 'CommandLine like "%sign%"' get ProcessId,ParentProcessId,CommandLine
CommandLine                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ParentProcessId  ProcessId
D:\git-sdk-arm64-build-installers\usr\bin\make.exe -f ../mingw-w64-git.mak sign-executables                                                                                                                                                                                                                                                                                                                                                                                                                                                            4408             6100
D:\git-sdk-arm64-build-installers\usr\bin\sh.exe -c "eval git --git-dir=D:/a/_work/git-for-windows-automation/git-for-windows-automation/.git signtool headless-git.exe git-daemon.exe git-http-backend.exe git-imap-send.exe git-sh-i18n--envsubst.exe git-shell.exe git-http-fetch.exe git-http-push.exe git-remote-http.exe git-remote-https.exe git-remote-ftp.exe git-remote-ftps.exe \    contrib/credential/wincred/git-credential-wincred.exe git.exe \ cmd/git{,-gui,k}.exe compat-bash.exe git-{bash,cmd,wrapper}.exe"                          1872             2684
D:\git-sdk-arm64-build-installers\usr\bin\sh.exe -c "eval git --git-dir=D:/a/_work/git-for-windows-automation/git-for-windows-automation/.git signtool headless-git.exe git-daemon.exe git-http-backend.exe git-imap-send.exe git-sh-i18n--envsubst.exe git-shell.exe git-http-fetch.exe git-http-push.exe git-remote-http.exe git-remote-https.exe git-remote-ftp.exe git-remote-ftps.exe \    contrib/credential/wincred/git-credential-wincred.exe git.exe \ cmd/git{,-gui,k}.exe compat-bash.exe git-{bash,cmd,wrapper}.exe"                          2684             5676
D:\git-sdk-arm64-build-installers\clangarm64\bin\git.exe --git-dir=D:/a/_work/git-for-windows-automation/git-for-windows-automation/.git signtool headless-git.exe git-daemon.exe git-http-backend.exe git-imap-send.exe git-sh-i18n--envsubst.exe git-shell.exe git-http-fetch.exe git-http-push.exe git-remote-http.exe git-remote-https.exe git-remote-ftp.exe git-remote-ftps.exe contrib/credential/wincred/git-credential-wincred.exe git.exe cmd/git.exe cmd/git-gui.exe cmd/gitk.exe compat-bash.exe git-bash.exe git-cmd.exe git-wrapper.exe  5676             5200
sh -c "sh \"/usr/src/build-extra/signtool.sh\" \"$@\"" "sh \"/usr/src/build-extra/signtool.sh\"" headless-git.exe git-daemon.exe git-http-backend.exe git-imap-send.exe git-sh-i18n--envsubst.exe git-shell.exe git-http-fetch.exe git-http-push.exe git-remote-http.exe git-remote-https.exe git-remote-ftp.exe git-remote-ftps.exe contrib/credential/wincred/git-credential-wincred.exe git.exe cmd/git.exe cmd/git-gui.exe cmd/gitk.exe compat-bash.exe git-bash.exe git-cmd.exe git-wrapper.exe                                                   5200             5024
C:\Windows\System32\Wbem\wmic.exe process where "CommandLine like \"%sign%\"" get ProcessId,ParentProcessId,CommandLine                                                                                                                                                                                                                                                                                                                                                                                                                                10112            10016

So indeed, there were quite a few, and the process with ID 5024 seemed to be the bottom-most one. To verify, I called:

$ wmic process where 'ParentProcessId=5024' get ProcessId,ParentProcessId,CommandLine
No Instance(s) Available.

Okay. Now on to debug. For that, I need gdb.exe. So I cloned the entire SDK using the command-line git -C /d/ clone --depth=1 https://github.com/git-for-windows/git-sdk-arm64 in the Git Bash. After that, I opened a Git SDK Bash by pressing the Windows button and then typing D:\git-sdk-arm64\git-bash.exe Enter.

Unfortunately, that does not work because of the OpenSSL situation where Git for Windows stays with v1.1.1 as long as possible while MSYS2 already switched to v3.:

$ gdb.exe
D:/git-sdk-arm64/usr/bin/gdb.exe: error while loading shared libraries: msys-python3.11.dll: cannot open shared object file: No such file or directory

So I downloaded MSYS2's python package via curl -LO https://repo.msys2.org/msys/x86_64/python-3.11.2-1-x86_64.pkg.tar.zst and then installed it via pacman -U ./python-3.11.2-1-x86_64.pkg.tar.zst. This let me start gdb.exe bash.exe 5024 (that stanza asks gdb to load the gdb.exe binary and attach to the process with ID 5024).

In that GDB session, I then tried to see the current threads:

(gdb) info thread
  Id   Target Id          Frame
  1    Thread 5024.0x10fc error return ../../gdb-11.1/gdb/windows-nat.c:609 was 31
0x0000000000000000 in ?? ()
  2    Thread 5024.0x1978 0x00007fff30ad6a64 in ?? ()
  3    Thread 5024.0x18a8 0x00007fff30ad69c4 in ?? ()
  4    Thread 5024.0x6a8  error return ../../gdb-11.1/gdb/windows-nat.c:609 was 31
0x0000000000000000 in ?? ()
  5    Thread 5024.0x1468 0x00007fff30ad69c4 in ?? ()
* 6    Thread 5024.0x2058 0x00007fff30ae2374 in ?? ()

Not very informative, so let's try to see the current thread's stacktrace:

(gdb) bt
#0  0x00007fff30ae2374 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

That's it.

My next attempt involved downloading Sysinternals' Process Explorer. That gave me more information. Apparently there is one thread with this stack trace:

ntoskrnl.exe!ZwUpdateWnfStateData+0xaf60
ntoskrnl.exe!KeUpdateThreadTag+0xdc98
ntoskrnl.exe!KiDispatchInterrupt+0x7298
ntoskrnl.exe!KeWaitForMutexObject+0x440
ntoskrnl.exe!KeUpdateThreadTag+0xa754
ntoskrnl.exe!KiDeliverApc+0x18c
ntoskrnl.exe!KeUpdateThreadTag+0xdd1c
ntoskrnl.exe!KiDispatchInterrupt+0x7298
ntoskrnl.exe!KeWaitForMutexObject+0x440
ntoskrnl.exe!ObWaitForSingleObject+0xb4
ntoskrnl.exe!NtWaitForSingleObject+0x34
ntoskrnl.exe!ZwUpdateWnfStateData+0xadf0
ntoskrnl.exe!ZwUpdateWnfStateData+0xa9e8
!ZwWaitLowEventPair+0x158284
!ZwWaitLowEventPair+0x19d190
!ZwWaitLowEventPair+0x1950a8
!ZwWaitLowEventPair+0x1a9ad0
!ZwWaitLowEventPair+0x290f80
!ZwWaitLowEventPair+0x29ff14
!_assert+0x41c6

So that looks as if it called _assert(). And another one with this stack trace:

ntoskrnl.exe!ZwUpdateWnfStateData+0xaf60
ntoskrnl.exe!KeUpdateThreadTag+0xdc98
ntoskrnl.exe!KiDispatchInterrupt+0x7298
ntoskrnl.exe!KeWaitForMutexObject+0x440
ntoskrnl.exe!FsRtlNotifyVolumeEventEx+0x4470
ntoskrnl.exe!NtQueryVolumeInformationFile+0xfe0
ntoskrnl.exe!NtReadFile+0xb0
ntoskrnl.exe!ZwUpdateWnfStateData+0xadf0
ntoskrnl.exe!ZwUpdateWnfStateData+0xa9e8
!ZwWaitLowEventPair+0x158324
!ZwWaitLowEventPair+0x178bc8
!PackageSidFromProductId+0x2318c0
!PackageSidFromProductId+0x26a360
!sigfillset+0x2d61

This looks as if it was inside the sigfillset() function. But that does not make too much sense because that function does not call any other function.

The other threads' stack trace can either not be accessed or look like they're completely outside of Bash's and MSYS2 runtime's source code:

ntoskrnl.exe!ZwUpdateWnfStateData+0xaf60
ntoskrnl.exe!KeUpdateThreadTag+0xdc98
ntoskrnl.exe!KiDispatchInterrupt+0x7298
ntoskrnl.exe!KeWaitForMutexObject+0x440
ntoskrnl.exe!KeUpdateThreadTag+0xa754
ntoskrnl.exe!KiDeliverApc+0x18c
ntoskrnl.exe!KeUpdateThreadTag+0xdd1c
ntoskrnl.exe!KiDispatchInterrupt+0x7298
ntoskrnl.exe!KeDelayExecutionThread+0x614
ntoskrnl.exe!PsWow64IsMachineSupported+0x2d04
ntoskrnl.exe!ZwUpdateWnfStateData+0xadf0
ntoskrnl.exe!ZwUpdateWnfStateData+0xa9e8
!ZwWaitLowEventPair+0x1616f4
!ZwWaitLowEventPair+0x1eea48
!ZwWaitLowEventPair+0x1edea8
!ZwWaitLowEventPair+0x1ebe40
!ZwWaitLowEventPair+0x18495c
!ZwWaitLowEventPair+0x187a54
!ZwWaitLowEventPair+0x1ac658
!ZwWaitLowEventPair+0x1abc70
!ZwWaitLowEventPair+0x1ad458
!uaw_wcsrchr+0x81c70
!ZwWaitLowEventPair+0x1a9bc8

I'm out of ideas what to investigate on that side, so I downloaded Sysinternals' handle utility to see what handles are open in that process, but that also fails to shed any light into the issue:

$ ./handle64a.exe -p 5024

Nthandle v5.0 - Handle viewer
Copyright (C) 1997-2022 Mark Russinovich
Sysinternals - www.sysinternals.com

  180: Section       \BaseNamedObjects\msys-2.0S5-c2a8dd8be8845dc5\shared.5
  184: Section       \BaseNamedObjects\msys-2.0S5-c2a8dd8be8845dc5\S-1-5-20.1
  1DC: File  (RWD)   D:\git-sdk-arm64-build-installers\usr\src\MINGW-packages\mingw-w64-git\src\git
  1EC: Section       \BaseNamedObjects\msys-2.0S5-c2a8dd8be8845dc5\cygpid.10728
  2C0: Section       \BaseNamedObjects\msys-2.0S5-c2a8dd8be8845dc5\cygpid.10728

So for now I'll just call it yet another unexplained hang and try to let the workflow run continue, at least, by force-stopping that process: wmic process where ProcessID=5024 delete.

And sure enough, the workflow run continues...

dscho commented 1 year ago

Darn, darn, darn. Again I had to RDP into the runner and force-stop the signtool process in order to let it complete (toggle the timestamps to see the 24 minute gap where there should be less than 10 seconds).

This is not good. In this shape, we cannot publish a Git for Windows/ARM64. If it hangs for us so often, it will hang for plenty of users, too.

dscho commented 1 year ago

And it's hanging again, and also here. I worked around this by RDP'ing into the machines and running essentially:

candidates= &&
for pid in $(
    wmic process where 'ExecutablePath like "%arm64-build-installers%"' get ProcessId |
    sed 1d
)
do
    test "z${pid#[0-9]}" != "z$pid" || continue
    case "$(wmic process where parentprocessid=$pid get processid 2>&1 >/dev/null)" in
     "No Instance(s) Available"*)
         candidates="$candidates $pid"
         ;;
    esac
done &&
wmic process where processid=${candidates##* } delete
dscho commented 1 year ago

I reached out to #cygwin-devel on libera.chat, and @lazka pointed me to https://github.com/msys2/msys2-autobuild/issues/62, noting:

it only happening for those two things and not in say autotools/configure/make seems to suggest it's not a general problem (those two things being gpgme signature verification, and install-info)

And @jturney said:

"Not a known issue, but now we know about it.." 😉

lazka commented 1 year ago

@jeremyd2019 tried to debug our hangs at one point

jeremyd2019 commented 1 year ago

yeah, I couldn't get anywhere though, none of the debuggers I tried to attach were able to get the thread context for the 'main' thread. there was also another thread running which seemed to be for cygwin signal handling, which seemed to be happily waiting. I gave up trying to debug it and instead added workarounds to avoid situations where it seemed to happen. Now that somebody else is reproducing it, maybe it can be debugged.

Regarding your _assert and sigfillset observations, notice the offsets from the symbols. It seems unlikely that those are actually the functions involved, just the nearest symbols it could find.

dscho commented 1 year ago

Regarding your _assert and sigfillset observations, notice the offsets from the symbols. It seems unlikely that those are actually the functions involved, just the nearest symbols it could find.

Right, I feared so much.

Now that somebody else is reproducing it, maybe it can be debugged.

I fear that we'll have to get a much more reliable reproducer first.

Maybe the following will reproduce the hang reliably enough: calling the MINGW executable osslsigncode.exe from the MSYS2 Bash script signtool.sh, which in turn is called as a Git alias from the MINGW executable git.exe that is called from either an MSYS2 Bash script or MSYS2 make.exe.

At least that is what hung pretty reliably in those CI runs for me, in the last run it was every single time that call chain happened. But earlier runs did not hang after code-signing (even if they failed later on, but for independent reasons). The hang was at the end of the inner-most shell script just after all the code-signing was done.

Alovchin91 commented 1 year ago

Since you're RDP'ing into the machine anyway, could you maybe collect a full memory dump of a hanging process using procdump? Maybe that could shed some more light on what it's waiting for 🤔

Alovchin91 commented 1 year ago

... calling the MINGW executable osslsigncode.exe from the MSYS2 Bash script signtool.sh, which in turn is called as a Git alias from the MINGW executable git.exe ...

Hmm, so git.exe calls signtool.sh, which signs git.exe, produces git.signed.exe (according to the script), and then moves it in place of the original? Sounds like the script tries to overwrite git.exe while it's in process of executing the script. Or is it hanging while signing a different executable?

dscho commented 1 year ago

@Alovchin91 it's different git.exe files. The running one is /clangarm64/bin/git.exe, and the signed one is /usr/src/MINGW-packages/src/git/git.exe. So that's not the problem.

dscho commented 1 year ago

Since you're RDP'ing into the machine anyway, could you maybe collect a full memory dump of a hanging process using procdump? Maybe that could shed some more light on what it's waiting for 🤔

@Alovchin91 I am curious what you suggest to do with that dump. Sure, I can collect that information, but what does that buy us that the previous investigation (#1 and #2) didn't reveal yet?

@dennisameling @jeremyd2019 did any of you manage to reproduce this in a non-CI setting, i.e. when starting a command interactively? I tried and failed to cause such a hang using interactive commands in an Azure VM, and would love to do that to accelerate the investigation.

Alovchin91 commented 1 year ago

@Alovchin91 I am curious what you suggest to do with that dump. Sure, I can collect that information, but what does that buy us that the previous investigation (#1 and #2) didn't reveal yet?

That would give some starting point to those who would like to help but cannot reproduce. From your very next sentence I understand that reproducing the issue is not a trivial task 😅 I've asked around on Mastodon if somebody (from Microsoft?) could help. That, of course, if you're looking for any help 😊

jeremyd2019 commented 1 year ago

The hangs that I experienced with pacman (both signature verification and info database updates) would happen frequently while updating/installing packages interactively. I know I did try the process dump option in task manager, but that dump was also 'missing' the context for the main thread.

From my anecdotal observations, I think this seems more likely to occur on slower machines (happened all the time on the raspberry pi, but was more rare on the QC710), but perhaps more likely with more cores? (I was playing with a 60 core VM and I think it was more likely there than with 8 core QC710).

I am set up to run a KVM virtual machine on a raspberry pi, I can try to reproduce the hangs I got with pacman and try things, if anybody has any ideas to try (I can try procdump and post that somewhere if somebody thinks it would help, for example).

My workarounds for the pacman hangs I saw, which seem quite effective for building packages for msys2 project: https://github.com/jeremyd2019/winautoconfig/blob/544a3c019fcbd8955e91af3d4a86e12e3d1b6a49/msys2-runner-setup/setupscripts/setup.cmd#L28-L31

dscho commented 1 year ago

The hangs that I experienced with pacman (both signature verification and info database updates) would happen frequently while updating/installing packages interactively.

@jeremyd2019 did you find a minimal reproducer there, i.e. ideally a command that only verifies the signature but does not necessarily update any package?

Also, do you happen to know whether the signature verification happens to use a MINGW program to perform that verification (e.g. /clangarm64/bin/gpg.exe)? In my cases, the hang always happened in the MSYS process associated with spawning a non-MSYS program...

dscho commented 1 year ago

@Alovchin91 the thing that makes me most dubious about the procdump suggestion is that we already attached to the process, using gdb.exe, and that neither procdump nor any other regular Win32 tools have any idea about the MSYS2 runtime or about GCC's DWARF debug symbols and are therefore even less likely than gdb.exe to show any useful information.

And FWIW there is already somebody from Microsoft helping this here ticket: me.

jeremyd2019 commented 1 year ago

@jeremyd2019 did you find a minimal reproducer there, i.e. ideally a command that only verifies the signature but does not necessarily update any package?

No, when I tried I ended up reproducing a different hang that also happened on x64, and was able to be debugged and was fixed in Cygwin, so it wasn't a complete loss, but after that was fixed my minimal reproducer no longer reproduced the hang.

Also, do you happen to know whether the signature verification happens to use a MINGW program to perform that verification (e.g. /clangarm64/bin/gpg.exe)? In my cases, the hang always happened in the MSYS process associated with spawning a non-MSYS program...

No, it uses msys2 gpg, via gpgme.

jeremyd2019 commented 1 year ago

@Alovchin91 the thing that makes me most dubious about the procdump suggestion is that we already attached to the process, using gdb.exe, and that neither procdump nor any other regular Win32 tools have any idea about the MSYS2 runtime or about GCC's DWARF debug symbols and are therefore even less likely than gdb.exe to show any useful information.

I tried gdb, windbg, the task manager process dump thing, even local kernel debugging, but was never able to get user mode context for that thread.

I think I heard that windbgx (the new version in the store) now supports DWARF symbols, but I haven't verified that myself, and I don't know if it would understand the split debug info that msys2-runtime uses. (does that even work right with 3.4? I thought I saw some additional stripping happening in msys2-runtime go by at one point)

dscho commented 1 year ago

when I tried I ended up reproducing a different hang that also happened on x64, and was able to be debugged and was fixed in Cygwin, so it wasn't a complete loss, but after that was fixed my minimal reproducer no longer reproduced the hang.

@jeremyd2019 that's interesting. It probably means that the hangs are related, at least it could give us a better idea which code path to look at. Would you happen to know the commit hash of the fix in Cygwin?

I think I heard that windbgx (the new version in the store) now supports DWARF symbols, but I haven't verified that myself, and I don't know if it would understand the split debug info that msys2-runtime uses.

Indeed. That's why I think attaching with gdb.exe is our best bet to get any useful stack trace.

But I'm starting to believe that this is a dead end, we might never get a useful stack trace because the issue is more likely that something in MSYS2's tear-down code is failing to tear down the signal event handler thread, and that one is blocking the correct and expected termination of the process.

So I fear that we'll need to work with heavy instrumentation ("debug small_printf() messages littered all over the tear-down code path") and then compare the logs of a hanging to a non-hanging invocation of any minimal reproducer that we can come up with.

dscho commented 1 year ago

Would you happen to know the commit hash of the fix in Cygwin?

@jeremyd2019 this might actually help address the hangs I observed, as Git for Windows is still on the Cygwin v3.3.* train.

jeremyd2019 commented 1 year ago

https://github.com/msys2/msys2-runtime/commit/0ce992c1e40f843d5f264e87908a6decfd6681d1. This was actually a crash, but the code was in such a state that it resulted in a hang in the exception handlers.

dscho commented 1 year ago

@jeremyd2019 thank you. Unfortunately my hope that this could help Git for Windows is dispelled by the fact that this commit made it into v3.3.4, and we're on v3.3.6 already, so we have that fix...

dscho commented 1 year ago

Update: I managed to reproduce the hang interactively, in my Azure VM, not on the first attempt, but the second. With a self-signed certificate and the /clangarm64/bin/git.exe of git-for-windows/git-sdk-arm64, I ran this:

git -c alias.signtool='!sh "/usr/src/build-extra/signtool.sh"' signtool $PWD/g.exe

where /usr/src/build-extra/signtool.sh comes from here and g.exe was the copy of a just-built git.exe (and now-already signed, due to the first attempt).

The self-signed certificate was created via:

$cert = New-SelfSignedCertificate -DnsName www.yourwebsite.com -Type CodeSigning -CertStoreLocation Cert:\CurrentUser\My
$CertPassword = ConvertTo-SecureString -String "my_passowrd" -Force -AsPlainText
Export-PfxCertificate -Cert (Get-ChildItem Cert:\CurrentUser\My -CodeSigningCert)[0] -FilePath ".sig\codesign.p12" -Password $CertPassword
"my_passowrd" |  Out-File -NoNewline -Encoding ascii ".sig\codesign.pass"

In a second MinTTY, when I run ps, I see this line:

     2129       1    2128       8176  pty0      197108 20:16:51 /usr/bin/sh <defunct>

The full output of that ps invocation is:

      PID    PPID    PGID     WINPID   TTY         UID    STIME COMMAND
     2021       1    2021      10056  ?         197108 20:14:27 /usr/bin/mintty
     2129       1    2128       8176  pty0      197108 20:16:51 /usr/bin/sh <defunct>
     2133    2132    2133       4132  pty1      197108 20:17:21 /usr/bin/bash
     2132       1    2132       9120  ?         197108 20:17:20 /usr/bin/mintty
     2217    2133    2217       1320  pty1      197108 20:28:19 /usr/bin/ps
     2022    2021    2022       8444  pty0      197108 20:14:27 /usr/bin/bash
     2128    2022    2128       6172  pty0      197108 20:16:51 /clangarm64/bin/git

Note that the pty0 processes refer to the hanging session, the pty1 processes are the ones running in the second MinTTY window where I run ps.

The really strange thing about this is that no Win32 process has ProcessId 8176 (remember: the MSYS2/Cygwin runtime maintains its own pid that is different from the actual Win32 ProcessId), but there is no such process when I look via wmic process where parentprocessid=8176 get ProcessId,ParentProcessId,CommandLine.

The parent process (MSYS2/Cygwin pid 2128, corresponding to the Win32 ProcessId 6172) refers to the git.exe process that I called, i.e. that sh should be the one that ran the signtool.sh script. But when I look for such a process, I see a different Win32 ProcessId:

$ wmic process where 'CommandLine like "%signtool%"' get ProcessId,ParentProcessId,CommandLine
CommandLine                                                                                                                                                                ParentProcessId  ProcessId
C:\git-sdk-arm64\clangarm64\bin\git.exe -c "alias.signtool=!sh \"/usr/src/build-extra/signtool.sh\"" signtool C:/git-sdk-arm64/usr/src/MINGW-packages/mingw-w64-git/g.exe  5652             6172
sh -c "sh \"/usr/src/build-extra/signtool.sh\" \"$@\"" "sh \"/usr/src/build-extra/signtool.sh\"" C:/git-sdk-arm64/usr/src/MINGW-packages/mingw-w64-git/g.exe               6172             6692
C:\Windows\System32\Wbem\wmic.exe process where "CommandLine like \"%signtool%\"" get ProcessId,ParentProcessId,CommandLine                                                6244             3288

The Win32 ProcessId of that sh process is 6692, and there is no line in the ps output corresponding to that process.

Another strange thing is that the ps output says that the parent process of that git.exe process has Win32 ProcessId 8444, but the wmic output suggests that the parent process has Win32 ProcessId 5652. So I looked for that process and it seems that it is running, with the parent process being the one with Win32 ProceessId 8444 and the same cmdline as that:

$ wmic process where processid=5652 get ProcessId,ParentProcessId,CommandLine
CommandLine                                   ParentProcessId  ProcessId
C:\git-sdk-arm64\usr\bin\bash.exe --login -i  8444             5652
Alovchin91 commented 1 year ago

And FWIW there is already somebody from Microsoft helping this here ticket: me.

Oh, I didn't know that bit, sorry 😅 I was just trying to be helpful 😬

I've been thinking about procdump because it can create a full dump. AFAIK task manager creates a minidump. My understanding is that it can be analysed by any tool afterwards. But that was just a suggestion and I don't think it's useful anymore 😀

Anyway, sorry for the distraction.

jeremyd2019 commented 1 year ago

Well, here's a full procdump of a hung bash.exe, if anyone wants to try to learn anything from it.

bash.exe_230218_214907.dmp.gz

WARNING: Teb 0 pointer is NULL - defaulting to 00000000`7ffde000
WARNING: 00000000`7ffde000 does not appear to be a TEB
WARNING: Thread 1bd0 context retrieval failure during dump writing, Win32 error 0n31

Seems familiar... So yes, this does still duplicate with 3.4 (this is msys2-runtime-3.4.5-1)

dscho commented 1 year ago

FWIW I could reproduce the hang on my Azure VM only once, and then no more :-(

jeremyd2019 commented 1 year ago

You might try putting the system under load while trying to reproduce it (trying to replicate my 'slower machines' observation). This was involved when I was trying to come up with a reproducer (that wound up finding that other bug), I was basically stressing fork/exec hoping it would eventually hang.