git-for-windows / git-for-windows-automation

A few GitHub workflows and support code to help with Git for Windows' day-to-day tasks
10 stars 9 forks source link

build-and-deploy: use `update-via-pacman.ps1` instead of a home-grown `pacman -Syyu` #61

Closed dscho closed 6 months ago

dscho commented 6 months ago

Needs https://github.com/git-for-windows/build-extra/pull/539 to be merged first.

The intricacies of updating the Git for Windows SDKs are not quite trivial. For one, you have to make sure to call pacman another time if "system packages" such as msys2-runtime have been updated. Also, in the end mingw-w64-git-extra has to be installed so that that package's post-install script can adjust a couple of system files in accordance with Git for Windows' requirements.

These intricacies are documented and automated in the update-via-pacman.ps1 script that is run as part of the sync GitHub workflows that are installed in the git-sdk-* repositories.

It only makes sense to use the same automation when building Git for Windows' packages, so let's do this.

This fixes https://github.com/git-for-windows/git-for-windows-automation/issues/29

rimrul commented 5 months ago

@dscho I think this change is consistently giving us the hang from https://github.com/msys2/msys2-autobuild/issues/62.

Both your deploy job for gnutls and my deploy job for openssl got stuck on

bash -lc "pacman -S --overwrite=* --noconfirm mingw-w64-clang-aarch64-git-extra".

rimrul commented 5 months ago

And that roughly coincides with my last attempt to debug the hanging deployments for https://github.com/git-for-windows/MINGW-packages/pull/105 that got stuck on pacman -Sy --noconfirm tmate openssh after we did pacman -Syyu --noconfirm

rimrul commented 4 months ago

@dscho I think this change is consistently giving us the hang from msys2/msys2-autobuild#62.

It only hangs there most of the time. We have one run now that did not hang. https://github.com/git-for-windows/git-for-windows-automation/actions/runs/7725404458/job/21059537359

EDIT: It did, but got unstuck by manual intervention. See https://github.com/git-for-windows/MINGW-packages/pull/109#issuecomment-1919065643

rimrul commented 4 months ago

Looking into this some more, pacman seems to be getting stuck at some point before https://gitlab.archlinux.org/pacman/pacman/-/blame/master/lib/libalpm/add.c#L94

Maybe --debug and/or --verbose in https://github.com/git-for-windows/git-sdk-arm64/blob/main/update-via-pacman.ps1#L146 could help investigate this further.

dscho commented 4 months ago

Looking into this some more, pacman seems to be getting stuck at some point before https://gitlab.archlinux.org/pacman/pacman/-/blame/master/lib/libalpm/add.c#L94

Yes, and it's a child process (marked with an identical command-line) of pacman -S --overwrite=* --noconfirm mingw-w64-clang-aarch64-git-extra. This child process has three threads after attaching with gdb, but attaching already comes with a problem or two:

[...]
(No debugging symbols found in /usr/bin/pacman.exe)
Attaching to program: /usr/bin/pacman.exe, process 5276
info thread
error return ../../gdb-13.2/gdb/windows-nat.c:1352 was 31: A device attached to the system is not functioning.
[New Thread 5276.0x163c]
[New Thread 5276.0x2634]
Error while mapping shared library sections:
`/c/Windows/SYSTEM32/ntdll.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/System32/xtajit64.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/System32/KERNEL32.DLL': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/System32/KERNELBASE.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/SYSTEM32/apphelp.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/SYSTEM32/CRYPT32.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/System32/ucrtbase.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/System32/advapi32.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/System32/msvcrt.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/System32/sechost.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/System32/bcrypt.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/System32/RPCRT4.dll': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/SYSTEM32/CRYPTBASE.DLL': not in executable format: file format not recognized
Error while mapping shared library sections:
`/c/Windows/System32/bcryptPrimitives.dll': not in executable format: file format not recognized

The problems continue:

(gdb) info thread
  Id   Target Id                Frame
  1    Thread 5276.0x3e4        error return ../../gdb-13.2/gdb/windows-nat.c:767 was 31: A device attached to the system is not functioning.
0x0000000000000000 in ?? ()
  2    Thread 5276.0x163c "sig" 0x00007ffe36029a64 in ?? ()
* 3    Thread 5276.0x2634       0x00007ffe36035374 in ?? ()

This happens only the first time, though, a subsequent info thread shows this:

(gdb) info thread
  Id   Target Id                Frame
  1    Thread 5276.0x3e4        0x0000000000000000 in ?? ()
  2    Thread 5276.0x163c "sig" 0x00007ffe36029a64 in ?? ()
* 3    Thread 5276.0x2634       0x00007ffe36035374 in ?? ()

The stack traces are useless:

(gdb) thread 2
[Switching to thread 2 (Thread 5276.0x163c)]
#0  0x00007ffe36029a64 in ?? ()
(gdb) bt
#0  0x00007ffe36029a64 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 1
[Switching to thread 1 (Thread 5276.0x3e4)]
#0  0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0
(gdb) thread 3
[Switching to thread 3 (Thread 5276.0x2634)]
#0  0x00007ffe36035374 in ?? ()
(gdb) bt
#0  0x00007ffe36035374 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Trying to attach gdb to the parent process (the one with the same command-line in wmic's output) also does not help:

Reading symbols from /usr/bin/pacman.exe...
(No debugging symbols found in /usr/bin/pacman.exe)
Can't attach to process 0 (error 2: The system cannot find the file specified.)
/c/Windows/system32/1064: No such file or directory.
(gdb) info thread
No threads.

Maybe --debug and/or --verbose in https://github.com/git-for-windows/git-sdk-arm64/blob/main/update-via-pacman.ps1#L146 could help investigate this further.

Good idea! I was able to reproduce the hang (or at least a hang, it might be for a different reason) when running the update-via-pacman.ps1 script via SSH. And when I do that, and try to run the script again, it hangs in the first pacman -Syyu call already. When I add --debug and --verbose, this reliably runs into a timeout while trying to synchronize the Pacman databases. Removing the yy lets it get a bit furthere, where it hangs at the GPGME version 1.23.2 line until I terminate that hanging process, then the output continues:

[...]
debug: pacman v6.0.2 - libalpm v13.0.2
debug: config: attempting to read file /etc/pacman.conf
debug: config: new section 'options'
debug: config: HoldPkg: pacman
debug: config: Architecture: auto
debug: config: arch: x86_64
debug: config: SigLevel: Required
debug: config: LocalFileSigLevel: Optional
debug: config: new section 'git-for-windows-aarch64'
debug: config: new section 'clangarm64'
debug: config file /etc/pacman.conf, line 78: including /etc/pacman.d/mirrorlist.mingw
debug: config: new section 'git-for-windows'
debug: config: new section 'git-for-windows-mingw32'
debug: config: new section 'mingw32'
debug: config file /etc/pacman.conf, line 87: including /etc/pacman.d/mirrorlist.mingw
debug: config: new section 'mingw64'
debug: config file /etc/pacman.conf, line 90: including /etc/pacman.d/mirrorlist.mingw
debug: config: new section 'ucrt64'
debug: config file /etc/pacman.conf, line 93: including /etc/pacman.d/mirrorlist.mingw
debug: config: new section 'clang32'
debug: config file /etc/pacman.conf, line 96: including /etc/pacman.d/mirrorlist.mingw
debug: config: new section 'clang64'
debug: config file /etc/pacman.conf, line 99: including /etc/pacman.d/mirrorlist.mingw
debug: config: new section 'msys'
debug: config file /etc/pacman.conf, line 102: including /etc/pacman.d/mirrorlist.msys
debug: config: finished parsing /etc/pacman.conf
debug: setup_libalpm called
debug: option 'logfile' = /var/log/pacman.log
debug: option 'gpgdir' = /etc/pacman.d/gnupg/
debug: option 'hookdir' = /etc/pacman.d/hooks/
debug: option 'cachedir' = /var/cache/pacman/pkg/
debug: registering sync database 'git-for-windows-aarch64'
debug: database path for tree git-for-windows-aarch64 set to /var/lib/pacman/sync/git-for-windows-aarch64.db
debug: GPGME version: 1.23.2
debug: GPGME engine info: file=/usr/bin/gpg, home=/etc/pacman.d/gnupg/
debug: checking signature for /var/lib/pacman/sync/git-for-windows-aarch64.db
debug: 1 signatures returned
debug: fingerprint: 91883E11E83DC29D14104DB4EDD44359093056EE
debug: summary: valid
debug: summary: green
debug: status: Success
debug: timestamp: 1706706098
debug: exp_timestamp: 0
[...]

The output continues only very slowly, though, which may or may not be of relevance here. It feels as if something is putting weights on the cart, so to say, I could imagine that there are lots of waiting for objects with lots of timeouts.

Eventually, the execution does reach that last pacman call, the one reinstalling mingw-w64-git-extra, and the familiar hang occurs again, and the output looks as following at the time:

[...]
debug: adding new server URL to database 'clang32': https://mirrors.piconets.webwerks.in/msys2-mirror/mingw/clang32
debug: adding new server URL to database 'clang32': https://www2.futureware.at/~nickoe/msys2-mirror/mingw/clang32
debug: adding new server URL to database 'clang32': https://mirrors.sjtug.sjtu.edu.cn/msys2/mingw/clang32
debug: adding new server URL to database 'clang32': https://mirrors.bit.edu.cn/msys2/mingw/clang32
debug: adding new server URL to database 'clang32': https://mirrors.aliyun.com/msys2/mingw/clang32
debug: adding new server URL to database 'clang32': https://mirror.iscas.ac.cn/msys2/mingw/clang32
debug: adding new server URL to database 'clang32': https://mirrors.cloud.tencent.com/msys2/mingw/clang32
debug: registering sync database 'clang64'
debug: database path for tree clang64 set to /var/lib/pacman/sync/clang64.db
debug: checking signature for /var/lib/pacman/sync/clang64.db
debug: 1 signatures returned
debug: fingerprint: 5F944B027F7FE2091985AA2EFA11531AA0AA7F57
debug: summary: valid
debug: summary: green
debug: status: Success
debug: timestamp: 1707857723
debug: exp_timestamp: 0
debug: validity: full; reason: Success
debug: key: 5F944B027F7FE2091985AA2EFA11531AA0AA7F57, Christoph Reiter (MSYS2 development key) <reiter.christoph@gmail.com>, owner_trust unknown, disabled 0
debug: signature is valid
debug: signature is fully trusted
debug: setting usage of 15 for clang64 repository  
debug: adding new server URL to database 'clang64': https://mirror.msys2.org/mingw/clang64
debug: adding new server URL to database 'clang64': https://repo.msys2.org/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirror.umd.edu/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirror.yandex.ru/mirrors/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://download.nus.edu.sg/mirror/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://ftp.acc.umu.se/mirror/msys2.org/mingw/clang64
debug: adding new server URL to database 'clang64': https://ftp.nluug.nl/pub/os/windows/msys2/builds/mingw/clang64
debug: adding new server URL to database 'clang64': https://ftp.osuosl.org/pub/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirror.internet.asn.au/pub/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirror.selfnet.de/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirrors.dotsrc.org/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirrors.bfsu.edu.cn/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirrors.tuna.tsinghua.edu.cn/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirrors.ustc.edu.cn/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirror.nju.edu.cn/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://repo.extreme-ix.org/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirror.clarkson.edu/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://quantum-mirror.hu/mirrors/pub/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirror.archlinux.tw/MSYS2/mingw/clang64
debug: adding new server URL to database 'clang64': https://fastmirror.pp.ua/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirror.ufro.cl/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://ftp.cc.uoc.gr/mirrors/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirror.jmu.edu/pub/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirrors.piconets.webwerks.in/msys2-mirror/mingw/clang64
debug: adding new server URL to database 'clang64': https://www2.futureware.at/~nickoe/msys2-mirror/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirrors.sjtug.sjtu.edu.cn/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirrors.bit.edu.cn/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirrors.aliyun.com/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirror.iscas.ac.cn/msys2/mingw/clang64
debug: adding new server URL to database 'clang64': https://mirrors.cloud.tencent.com/msys2/mingw/clang64
debug: registering sync database 'msys'
debug: database path for tree msys set to /var/lib/pacman/sync/msys.db
debug: checking signature for /var/lib/pacman/sync/msys.db
debug: 1 signatures returned
debug: fingerprint: 5F944B027F7FE2091985AA2EFA11531AA0AA7F57
debug: summary: valid
debug: summary: green
debug: status: Success
debug: timestamp: 1707857792
debug: exp_timestamp: 0
debug: validity: full; reason: Success

I also prefixed that pacman call with strace -fo /tmp/a1 and the contents of that log file look like this at the time of the hang:

[...]
  225 22098986 [main] pacman 2458 dofork: entering
 2426 22101412 [main] pacman 2458 sig_send: sendsig 0x1B4, pid 2458, signal -72, its_me 1
  211 22101623 [main] pacman 2458 sig_send: wakeup 0x8AC
  237 22101860 [main] pacman 2458 sig_send: Waiting for pack.wakeup 0x8AC
  220 22102080 [sig] pacman 2458 wait_sig: signalling pack.wakeup 0x8AC
  202 22102282 [main] pacman 2458 sig_send: returning 0x0 from sending signal -72
  225 22102507 [main] pacman 2458 frok::parent: priority class 32
  319 22102826 [main] pacman 2458 frok::parent: stack - bottom 0x800000000, top 0x7FFFFC000, addr 0x7FFE00000, guardsize 0x3000
  218 22103044 [main] pacman 2458 frok::parent: CreateProcessW (C:\git-sdk-arm64-full\usr\bin\pacman.exe, C:\git-sdk-arm64-full\usr\bin\pacman.exe, 0, 0, 1, 0x420, 0, 0, 0x7FFFFC1D0, 0x7FFFFC170)
 3966 22107010 [main] pacman 2458 frok::parent: forked pid 10052
  195 22107205 [main] pacman 2458 child_info::sync: n 2, waiting for subproc_ready(0x8B4) and child process(0x880)
20344 22127549 [main] pacman 2458 child_info::sync: pid 10052, WFMO returned 0, exit_code 0x103, res 1
  306 22127855 [main] pacman 2458 open_shared: name cygpid.2529, shared 0x1A4260000 (wanted 0x1A4260000), h 0x7D4, m 5, created 0
  330 22128185 [main] pacman 2458 time: 1707863011 = time(0x0)
  222 22128407 [main] pacman 2458 proc_subproc: args: 1, -15984
  206 22128613 [main] pacman 2458 proc_subproc: returning 1
  202 22128815 [main] pacman 2458 frok::parent: child is alive (but stopped)
  276 22129091 [main] pacman 2458 child_copy: stack - hp 0x880 low 0x7FFFFC290, high 0x800000000, res 1
  197 22129288 [main] pacman 2458 child_copy: done
  216 22129504 [main] pacman 2458 resume_child: signalled child
  191 22129695 [main] pacman 2458 child_info::sync: n 2, waiting for subproc_ready(0x8B4) and child process(0x880)
  266 22129961 [main] pacman 2458 child_info::sync: pid 2529, WFMO returned 0, exit_code 0x103, res 1
  200 22130161 [main] pacman 2458 proc_subproc: args: 2, -15984
  605 22130766 [waitproc] pacman 2458 SetThreadName: SetThreadDescription() failed. 0000006D 10000000
  184 22130950 [main] pacman 2458 pinfo::wait: created tracking thread for pid 2529, winpid 0x2744, rd_proc_pipe 0x8AC
  195 22131145 [main] pacman 2458 proc_subproc: added pid 2529 to proc table, slot 0
  196 22131341 [main] pacman 2458 proc_subproc: returning 1
  201 22131542 [main] pacman 2458 resume_child: signalled child
  206 22131748 [main] pacman 2458 sig_send: sendsig 0x1B4, pid 2458, signal -73, its_me 1
  191 22131939 [main] pacman 2458 sig_send: wakeup 0x8A4
  204 22132143 [main] pacman 2458 sig_send: Waiting for pack.wakeup 0x8A4
  217 22132360 [sig] pacman 2458 wait_sig: signalling pack.wakeup 0x8A4
  202 22132562 [main] pacman 2458 sig_send: returning 0x0 from sending signal -73
 2183 22134745 [main] pacman 2458 dofork: 2529 = fork()
  179 22134924 [main] pacman 2458 wait4: calling proc_subproc, pid 2529, options 0
  215 22135139 [main] pacman 2458 proc_subproc: args: 5, -7728
  211 22135350 [main] pacman 2458 proc_subproc: wval->pid 2529, wval->options 0
  209 22135559 [main] pacman 2458 checkstate: child_procs count 1
  168 22135727 [main] pacman 2458 stopped_or_terminated: considering pid 2529, pgid 2458, w->pid 2529
  181 22135908 [main] pacman 2458 checkstate: no matching terminated children found
  194 22136102 [main] pacman 2458 checkstate: returning -1
  148 22136250 [main] pacman 2458 proc_subproc: only found non-terminated children
  202 22136452 [main] pacman 2458 proc_subproc: finished processing terminated/stopped child
  198 22136650 [main] pacman 2458 proc_subproc: returning 1
--- Process 10624 (pid: 2458) thread 8604 exited with status 0x0
--- Process 10624 (pid: 2458) thread 10372 exited with status 0x0
--- Process 10624 (pid: 2458) thread 10800 exited with status 0x0
111878466 134015116 [] pacman 2458 _cygtls::remove: wait 0
--- Process 10624 (pid: 2458) thread 5476 exited with status 0x0
  319 134015435 [] pacman 2458 _cygtls::remove: wait 0

For the record, the relevant processes at that time are, according to wmic:

$ wmic process where "commandline like '%pacm%'" get commandline,parentprocessid,processid
CommandLine                                                                                                                                                                            ParentProcessId  ProcessId  
C:\git-sdk-arm64-full\usr\bin\pacman.exe -Syyu --overwrite=* --noconfirm                                                                                                               2792             10672      
"C:\git-sdk-arm64-full\usr\bin\bash.exe" -lc "strace -fo /tmp/a1 pacman --verbose --debug -S --overwrite=\* --noconfirm mingw-w64-clang-aarch64-git-extra"                             5920             5660       
C:\git-sdk-arm64-full\usr\bin\strace.exe -fo C:/Users/self-runner-debug/AppData/Local/Temp/a1 pacman --verbose --debug -S --overwrite=* --noconfirm mingw-w64-clang-aarch64-git-extra  5660             10524      
pacman --verbose --debug -S --overwrite=* --noconfirm mingw-w64-clang-aarch64-git-extra                                                                                                10524            10624      
pacman --verbose --debug -S --overwrite=* --noconfirm mingw-w64-clang-aarch64-git-extra                                                                                                10624            10052      
C:\Windows\System32\Wbem\wmic.exe process where "commandline like '%pacm%'" get commandline,parentprocessid,processid                                                                  8476             8076       

According to ps they are:

$ ps
      PID    PPID    PGID     WINPID   TTY         UID    STIME COMMAND
     2458       1    2458      10624  cons0     197108 22:23:08 /usr/bin/pacman
     2529    2458    2458      10052  cons0     197108 22:23:31 /usr/bin/pacman
     1901    1900    1901      11156  pty1      197108 22:06:16 /usr/bin/bash
     1900       1    1900       2136  ?         197108 22:06:15 /usr/bin/mintty
     2587    1901    2587        568  pty1      197108 22:36:18 /usr/bin/ps
     2118       1    2064      10672  cons0     197108 22:10:44 /usr/bin/pacman
     1840    1839    1840      11040  pty0      197108 21:42:04 /usr/bin/bash
     1839       1    1839      10928  ?         197108 21:42:04 /usr/bin/mintty
     2441       1    2322      10524  cons0     197108 22:23:07 /usr/bin/strace

After wmic process where processid=10052 delete, the process continues (sadly, the SSH session seems to have gotten stuck due to a long time without activity, therefore I do not see any more output after that debug: validity: full; reason: Success line. The strace output continues, though, and ends with the somewhat concerning exit status 0xc000013a (which corresponds to STATUS_CONTROL_C_EXIT, even though there was no Ctrl+C nor a ConsoleCtrlEvent):

[...]
--- Process 10624 (pid: 2458) thread 4908 exited with status 0xc000013a
--- Process 10624 (pid: 2458) thread 6492 exited with status 0xc000013a
--- Process 10624 (pid: 2458) thread 10472 exited with status 0xc000013a
--- Process 10624 (pid: 2458) exited with status 0xc000013a

Here is another potentially interesting part of the strace output, starting with the line that is shown at the time of the hang (i.e. until that wmic process ... delete call):

[...]
  319 134015435 [] pacman 2458 _cygtls::remove: wait 0
--- Process 10624 (pid: 2458) thread 10512 exited with status 0x0
851693793 985709228 [waitproc] pacman 2458 pinfo::maybe_set_exit_code_from_windows: pid 2529, exit value - old 0x0, windows 0x0, MSYS 0x8000000
  303 985709531 [waitproc] pacman 2458 sig_send: sendsig 0x1B4, pid 2458, signal 20, its_me 1
  222 985709753 [waitproc] pacman 2458 sig_send: Not waiting for sigcomplete.  its_me 1 signal 20
  170 985709923 [sig] pacman 2458 sigpacket::process: signal 20 processing
  173 985710096 [waitproc] pacman 2458 sig_send: returning 0x0 from sending signal 20
  186 985710282 [sig] pacman 2458 init_cygheap::find_tls: sig 20
  216 985710498 [waitproc] pacman 2458 proc_waiter: exiting wait thread for pid 2529
  160 985710658 [sig] pacman 2458 sigpacket::process: using tls 0x7FFFFCE00
  270 985710928 [sig] pacman 2458 sigpacket::process: signal 20 default is currently ignore
  159 985711087 [sig] pacman 2458 sigpacket::process: returning 1
  182 985711269 [sig] pacman 2458 proc_subproc: args: 4, 0
  202 985711471 [sig] pacman 2458 proc_subproc: looking for processes to reap, count 1
  169 985711640 [sig] pacman 2458 checkstate: child_procs count 1
  161 985711801 [sig] pacman 2458 stopped_or_terminated: considering pid 2529, pgid 2458, w->pid 2529
  191 985711992 [sig] pacman 2458 remove_proc: removing chld_procs[0], pid 2529, child_procs count 1
  183 985712175 [main] pacman 2458 wait4: 0 = cygwait (...)
  167 985712342 [main] pacman 2458 wait4: 2529 = wait4(2529, 0x0, 0, 0x0)
  207 985712549 [sig] pacman 2458 checkstate: returning 1
  208 985712757 [sig] pacman 2458 proc_subproc: released waiting thread
  384 985713141 [sig] pacman 2458 proc_subproc: finished processing terminated/stopped child
  411 985713552 [sig] pacman 2458 proc_subproc: returning 1
  211 985713763 [main] pacman 2458 close: close(8)
  175 985713938 [main] pacman 2458 fhandler_pipe::release_select_sem: close(PIPEW) release 1
  209 985714147 [main] pacman 2458 fhandler_base::close: closing 'pipe:[365072230784]' handle 0x894
  189 985714336 [main] pacman 2458 close: 0 = close(8)
  545 985714881 [main] pacman 2458 close: close(10)
  192 985715073 [main] pacman 2458 fhandler_pipe::release_select_sem: close(PIPEW) release 1
  184 985715257 [main] pacman 2458 fhandler_base::close: closing 'pipe:[369367198080]' handle 0x84C
  193 985715450 [main] pacman 2458 close: 0 = close(10)
  504 985715954 [main] pacman 2458 close: close(12)
  190 985716144 [main] pacman 2458 fhandler_pipe::release_select_sem: close(PIPEW) release 1
  163 985716307 [main] pacman 2458 fhandler_base::close: closing 'pipe:[373662165376]' handle 0x850
  167 985716474 [main] pacman 2458 close: 0 = close(12)
 1685 985718159 [main] pacman 2458 pselect: pselect (12, 0x7FFFFC500, 0x7FFFFC4F0, 0x7FFFFC4E0, 0x7FFFFC490, 0x0)
  224 985718383 [main] pacman 2458 pselect: to->tv_sec 1, to->tv_nsec 0, us 1000000
  478 985718861 [main] pacman 2458 dtable::select_read: pipe:[365072230784] fd 7
  366 985719227 [main] pacman 2458 dtable::select_read: pipe:[369367198080] fd 9
  320 985719547 [main] pacman 2458 dtable::select_read: pipe:[373662165376] fd 11
  179 985719726 [main] pacman 2458 select: sel.always_ready 0
  506 985720232 [pipesel] pacman 2458 SetThreadName: SetThreadDescription() failed. 0000006D 10000000
  194 985720426 [main] pacman 2458 select_stuff::wait: m 4, us 1000000, wmfo_timeout -1
  166 985720592 [pipesel] pacman 2458 peek_pipe: read: pipe:[373662165376], n -1
  173 985720765 [pipesel] pacman 2458 peek_pipe: read: pipe:[369367198080], ready for read: avail 1
  202 985720967 [pipesel] pacman 2458 peek_pipe: read: pipe:[365072230784], ready for read: avail 1
  182 985721149 [main] pacman 2458 select_stuff::wait: wait_ret 2, m = 4.  verifying
  235 985721384 [main] pacman 2458 set_bits: me 0xA00041370, testing fd 11 (pipe:[373662165376])
[...]

This seems to happen with or without enable_pcon. And with the latest console-related fixes backported into Git for Windows' latest MSYS2 runtime revision.

mati865 commented 4 months ago

Binutils doesn't support AArch64 Windows yet but LLDB should work if you still need a debugger.

dscho commented 4 months ago

@mati865 thank you for the advice! In this instance, I am still working on the x86_64 MSYS2 runtime, even if it is running in an ARM64 VM: There is no aarch64 MSYS2 runtime as of time of writing.

rimrul commented 4 months ago

They might still have a point that LLDB could be helpfull. According to this blog post most of the System32 DLLs are probably ARM64X, which could explain the not in executable format: file format not recognized messages.

rimrul commented 4 months ago

They might still have a point that LLDB could be helpfull. According to this blog post most of the System32 DLLs are probably ARM64X, which could explain the not in executable format: file format not recognized messages.

And I just tested the linked script to check for CHPEv2 DLLs. After some slight modifications to work with lief 0.14.1 it reports all the DLLs that produced the not in executable format: file format not recognized message as being ARM64X.

rimrul commented 4 months ago

Looking at https://github.com/llvm/llvm-project/commit/48feef277a24b1b9c0ff33267a91e70d9584012e, LLDB probably wouldn't know what to do with these DLLs here either.

mati865 commented 4 months ago

Oh, sorry. I had forgot it's still x86_64 when I saw AArch64 libs in the output. @rimrul is right, it's possible that LLDB will work better but unlikely.

I don't know how the emulation works but if's close enough to real x86_64 gdb should also work if you somehow convince it to find x86_64 libs.

dscho commented 4 months ago

Hmm. Y'all got a good point. For shiggles, I ran a simple program in gdb to see the stacktrace. It worked. Then I switched to the "sig" thread, and it gave me the same bogus stacktrace. But now the kicker: switching back to the original thread, it also gave me a bogus stacktrace! So I think debugging with gdb is simply not going to work.

Sadly, WinDbg seems to have problems attaching, and Visual Studio (while it can attach) produces stacktraces that look quite dubious, highly likely mixing up function symbols within msys-2.0.dll. So I guess it'll be back to good ole' printf()-style debugging.

mati865 commented 4 months ago

Visual Studio doesn't really work with DWARF debug information. Supposedly, PDB support in binutils was fixed last year, so you would need to rebuild the library and add --pdb=some/path.pdb or -Wl,--pdb=some/path.pdb. However, this might be quite a bit of work for a dubious benefit.

jeremyd2019 commented 4 months ago

I hope you have better luck than I have trying to track this down. My latest attempt was to backport a change that didn't attempt to even run the FAST_CWD stuff on ARM64, because of a recent report that it managed to happen to get confused and make a "false positive" match on some ARM code (msys2/msys2-runtime@4e77fa9b8bf4). After that, I tried to reproduce the issue on both i686 (on Windows 10 ARM64) and x86_64 (on Windows 11 ARM64) msys2. I am seeing hangs on i686, but they make no sense... once I saw it hang in CloseHandle, and last couple of times it seems to be hanging in TerminateThread. Both are related to pinfo apparently during process teardown however, maybe that is a clue.

What I've had to do to get some sort of reasonable stack traces: attach with windbgx from the store (it says there's a new version available that's not in preview, but I haven't tried that yet). It times out on initial breakin. Then I hit the "break" toolbar button, and do .dump /ma c:\pacman.dmp. I then use \msys2\clangarm64\bin\lldb --core pacman.dmp and rely on lldb's ability to read DWARF symbols. Attempting to actually attach directly with gdb or lldb results either inability to break in (gdb) or no usable stack traces (lldb).

I did just see that I reproduced the hang on x86_64 on windows 11 arm64, but windbgx is giving me the old GetContextState failed, 0x8007001f error so no new info there.

jeremyd2019 commented 4 months ago

In case it helps your investigation, here's what I'm currently seeing on i686:

(lldb) bt
* thread #1, stop reason = Exception 0x80000007 encountered at address 0x000000
  * frame #0: 0x00010002
    frame #1: 0x75a4d3b4 KERNELBASE.dll
    frame #2: 0x76ef3cd4 kernel32.dll
    frame #3: 0x610064a3 msys-2.0.dll`cygthread::terminate_thread(this=0x611fd330)
    frame #4: 0x610f402f msys-2.0.dll`proc_terminate()
    frame #5: 0x610c909b msys-2.0.dll`_ZN5pinfo4exitEm@8(this=0x6121e0a0, n=0)
    frame #6: 0x61009c2e msys-2.0.dll`_Z7do_exiti@4(status=0)
    frame #7: 0x61009d25 msys-2.0.dll`_exit(n=0)
    frame #8: 0x61156d4d msys-2.0.dll`__sigfe
(lldb) frame 4
invalid command 'frame 4'.
(lldb) up
frame #1: 0x75a4d3b4 KERNELBASE.dll
->  0x75a4d3b4: pushal
    0x75a4d3b5: addb   %bh, %al
    0x75a4d3b7: aaa
    0x75a4d3b8: andb   %al, (%eax)
(lldb)
frame #2: 0x76ef3cd4 kernel32.dll
->  0x76ef3cd4: sti
    0x76ef3cd5: addl   (%eax), %eax
    0x76ef3cd7: subb   -0x2f6efffd(%edi), %bh
    0x76ef3cdd: addb   %al, (%eax)
(lldb)
frame #3: 0x610064a3 msys-2.0.dll`cygthread::terminate_thread(this=0x611fd330)
(lldb) p *this
error: msys-2.0.dll [0x00000000005aed62]: DIE has DW_AT_ranges(DW_FORM_sec_offset 0x00000000000075be) attribute, but range extraction failed (invalid range list offset 0x75be), please file a bug and attach the file at the start of this error message
(ygthread) {
  inuse = 1
  id = 5964
  h = 0x000007a8
error: msys-2.0.dll [0x00000000005aed62]: DIE has DW_AT_ranges(DW_FORM_sec_offset 0x00000000000075be) attribute, but range extraction failed (invalid range list offset 0x75be), please file a bug and attach the file at the start of this error message
(lldb)   ev = 0x00000004
  thread_sync = 0x000000c4
  stack_ptr = 0x0acccca0
  __name = 0x6124bde3 "waitproc"
  func = 0x610cc5b3 (msys-2.0.dll`::proc_waiter(void *) at 1196:1)
  arglen = 0
  arg = 0x6121e390
  is_freerange = false
  notify_detached = 0x00000000
}
(lldb)
error: msys-2.0.dll [0x00000000005aed62]: DIE has DW_AT_ranges(DW_FORM_sec_offset 0x00000000000075be) attribute, but range extraction failed (invalid range list offset 0x75be), please file a bug and attach the file at the start of this error message
(lldb) (cygthread) {
  inuse = 1
  id = 5964
  h = 0x000007a8
  ev = 0x00000004
  thread_sync = 0x000000c4
  stack_ptr = 0x0acccca0
  __name = 0x6124bde3 "waitproc"
  func = 0x610cc5b3 (msys-2.0.dll`::proc_waiter(void *) at 1196:1)
  arglen = 0
  arg = 0x6121e390
  is_freerange = false
  notify_detached = 0x00000000
}
(lldb) up
frame #4: 0x610f402f msys-2.0.dll`proc_terminate()
(lldb) p chld_procs
(child_procs) {
  _count = 1
  _procs = {
   ...
  }
  _procs_2 = nullptr
}
(lldb) p *(pinfo*)chld_procs._procs
(pinfo) {
  pinfo_minimal = (h = 0x000007ac, hProcess = 0x000007bc, rd_proc_pipe = 0x000007a4)
  destroy = false
  winpid_hdl = 0x00000000
  procinfo = 0x00100000
  waiter_ready = false
  wait_thread = 0x611fd330
}
(lldb) p *((pinfo*)chld_procs._procs)->wait_thread
error: msys-2.0.dll [0x00000000005aed62]: DIE has DW_AT_ranges(DW_FORM_sec_offset 0x00000000000075be) attribute, but range extraction failed (invalid range list offset 0x75be), please file a bug and attach the file at the start of this error message
(lldb) (cygthread) {
  inuse = 1
  id = 5964
  h = 0x000007a8
  ev = 0x00000004
  thread_sync = 0x000000c4
  stack_ptr = 0x0acccca0
  __name = 0x6124bde3 "waitproc"
  func = 0x610cc5b3 (msys-2.0.dll`::proc_waiter(void *) at 1196:1)
  arglen = 0
  arg = 0x6121e390
  is_freerange = false
  notify_detached = 0x00000000
}

Where it disappears into kernel32/kernelbase is calling TerminateThread. Examining the stack in windbg shows it passing 0x7a8, which matches the handle in the cygthread. Examining the handle with !handle in windbg shows a valid handle for a thread, but the thread in question is not showing up in the debugger so I assume it is already dead. The handles in pinfo_minimal also appear to be valid and of the correct type, again using !handle.

I was just able to reproduce the first hang I saw, which was in pinfo::release calling CloseHandle. The handle in question, according to !handle, is an event (which doesn't make sense actually), but all of the handle members of the pinfo this are 0/NULL. It is also a member of the chld_procs array. based on the pointer index 1. At the time of making the dump file, the chld_procs _count is 0 also.

jeremyd2019 commented 4 months ago

Previously, looking at the code, with constructs like

  HANDLE close_h;
  if (h)
    {
      close_h = h;
      h = NULL;
      ForceCloseHandle1 (close_h, pinfo_shared_handle);
    }

I was thinking maybe there are expected to be multiple threads doing this, in which case maybe they want something like InterlockedExchangePointer. But I don't know how the TerminateThread one would play into things. Maybe another thread was closing the handle at the same time? !handle doesn't seem to indicate anything funny about the thread handle, but it might be looking at old data.

jeremyd2019 commented 4 months ago

Seen the CloseHandle variant a few more times, one time !handle said it could not duplicate the handle, another it was a WaitCompletionPacket. I have to think the handle is garbage, but the question is if it was garbage at the time or if other thread(s) changed things around in the meantime while I was waiting to notice it was hung.

jeremyd2019 commented 3 months ago

I tried InterlockedExchangePointer in pinfo::release (and pinfo::_pinfo_release) but still got the same hang calling CloseHandle.

The backtrace from lldb does not go very far, but from what I can tell trying to symbolicate the backtrace from windbg (which has some extra garbage in it) this also occurs from do_exit -> pinfo::exit -> proc_terminate, calling pinfo::release instead of cygthread::terminate_thread, which are both called from the same loop over chld_procs.

dscho commented 3 months ago

I don't know... I tried that CloseHandle() route, by doing this:

diff --git a/winsup/cygwin/local_includes/debug.h b/winsup/cygwin/local_includes/debug.h
index 858f8d5dcd..59e581dafb 100644
--- a/winsup/cygwin/local_includes/debug.h
+++ b/winsup/cygwin/local_includes/debug.h
@@ -9,7 +9,7 @@ details. */

 #define being_debugged() (IsDebuggerPresent ())

-#ifndef DEBUGGING
+#if 1 // #ifndef DEBUGGING
 # define cygbench(s)
 # define ForceCloseHandle CloseHandle
 # define ForceCloseHandle1(h, n) CloseHandle (h)

That forces all the CloseHandle(), ForceCloseHandle1(), and other similar calls, to go through close_handle() which sets the HANDLE_FLAG_PROTECT_FROM_CLOSE flag, i.e. it prevents the handles from being closed.

Then I ran my reproducer. It worked once! So I was excited. But the next attempt hung in the very same way, twice. I got it unstuck both times by running these PowerShell statements:

$p = (Get-CimInstance Win32_Process |`
  Where-Object {$_.CommandLine -like '*pacman*'} |`
  Select-Object CommandLine, ParentProcessId, ProcessId |`
  Group-Object CommandLine |`
  Where-Object {$_.Count -gt 1} |`
  Select-Object -ExpandProperty Group)
if ($p.Length -eq 2 -and $p[0].processId -eq $p[1].parentProcessId) {
  Stop-Process -Force -Id $p[1].processId
}

@jeremyd2019 I wonder whether you know what this child process is that is hanging, where it is spawned and what its purpose is?

jeremyd2019 commented 3 months ago

I don't know what the child process is for. Perhaps it is the fork-ed process before it calls exec? (for me, I reproduce this in the path where pacman is using gpgme to validate signatures. it sounds like you reproduce this with non-cygwin processes involved)

The CloseHandle area was one of two places I saw a hang - the other was in TerminateThread, trying to terminate the wait_thread from proc_terminate. Perhaps that is where you were seeing additional hangs. I wish some debugger would show what is going on in the x86_64 process. I was surprised I saw anything trying to debug the i686 version, hope it wasn't a red herring.