madler / pigz

A parallel implementation of gzip for modern multi-processor, multi-core machines.
http://zlib.net/pigz/
2.66k stars 177 forks source link

pigz hanging waiting on lock #116

Open AZaugg opened 9 months ago

AZaugg commented 9 months ago

I am seeing an issue with pigz where it's getting stuck. I have cpio piping data over to pigz

root@m [ /proc/2018279/fd ]# ls -l
total 0
lr-x------ 1 root root 64 Feb  6 07:05 0 -> 'pipe:[881736186]'
l-wx------ 1 root root 64 Feb  6 07:05 1 -> /var/tmp/dracut.IsQD2b/initramfs.img
l-wx------ 1 root root 64 Feb  6 07:05 2 -> /dev/null
root@m [ /proc/2018279/fd ]# cd /proc/2018278/fd
root@m [ /proc/2018278/fd ]# ls -l
total 0
lr-x------ 1 root root 64 Feb  6 07:05 0 -> 'pipe:[881736185]'
l-wx------ 1 root root 64 Feb  6 07:05 1 -> 'pipe:[881736186]'
l-wx------ 1 root root 64 Feb  6 07:05 2 -> /dev/null
lr-x------ 1 root root 64 Feb  6 07:05 3 -> /var/tmp/dracut.IsQD2b/initramfs/usr/bin/less

On the pigz side i can see:

root@m [ /proc/2018279 ]# strace -p 2018279 -f
strace: Process 2018279 attached with 19 threads
[pid 2018297] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018296] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018295] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018294] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018293] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018292] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018291] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018290] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018289] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018288] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018287] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018286] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018285] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018284] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018283] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018282] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018281] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 2018280] futex(0x5945315b1114, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2018279] futex(0x59452f6ce6e0, FUTEX_WAIT_PRIVATE, 2, NULL

Stuck on a lock, looking at the stack

root@m[ /proc/2018279 ]# cat stack
[<0>] futex_wait_queue_me+0xa2/0x100
[<0>] futex_wait+0x105/0x250
[<0>] do_futex+0x1a2/0xaf0
[<0>] __x64_sys_futex+0x78/0x1e0
[<0>] do_syscall_64+0x5c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1

Has anyone seen pigz get stuck like this?

madler commented 9 months ago

What operating system?

AZaugg commented 9 months ago

Azure Linux Kernel 5.15.125.1-2 glibc-2.35-6 pigz-2.6-2

I should add, looking at the core

(gdb) info threads
  Id   Target Id                           Frame
* 1    Thread 0x73b25af1a700 (LWP 2814192) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  2    Thread 0x73b25af19640 (LWP 2814193) 0x000073b25afa5e8a in __futex_abstimed_wait_common () from /lib/libc.so.6
  3    Thread 0x73b25a6d6640 (LWP 2814194) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  4    Thread 0x73b259eaa640 (LWP 2814195) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  5    Thread 0x73b25965d640 (LWP 2814196) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  6    Thread 0x73b258e31640 (LWP 2814197) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  7    Thread 0x73b243fff640 (LWP 2814198) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  8    Thread 0x73b2437fe640 (LWP 2814199) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  9    Thread 0x73b242ffd640 (LWP 2814200) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  10   Thread 0x73b2427fc640 (LWP 2814201) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  11   Thread 0x73b241ffb640 (LWP 2814202) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  12   Thread 0x73b2417fa640 (LWP 2814203) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  13   Thread 0x73b240ff9640 (LWP 2814204) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  14   Thread 0x73b22bfff640 (LWP 2814205) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  15   Thread 0x73b22b7fe640 (LWP 2814206) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  16   Thread 0x73b22affd640 (LWP 2814207) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  17   Thread 0x73b22a7fc640 (LWP 2814208) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
  18   Thread 0x73b229ffb640 (LWP 2814209) 0x000073b25afa605f in __lll_lock_wait () from /lib/libc.so.6
madler commented 9 months ago

I have not seen this exactly before, but there have been two reports on SuSE systems of a hang due to a pthread bug in that system, which is why I asked about your OS.

There is this report of a pthread bug in glibc that could impact pigz due to its use of condition waits. If you look at those messages, the one at the end from just last month is asking about whether a fix to glibc has been made or not. Sounds like not.

Your problem may be related to that, or it may be something else. These sorts of reports are very rare, so it is difficult to conclude anything.

It seems that pthread is a difficult thing to write correctly.

rtissera commented 7 months ago

I can report the issue too on Debian 11, kernel 6.6.13 (backports), Beelink SER5 Pro (Ryzen 7 5800H, 32 GB RAM, NVMe SSD).