openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.56k stars 1.74k forks source link

io_uring worker stuck and unkillable writing to a ZFS file using a fixed buffer #16133

Open talex5 opened 6 months ago

talex5 commented 6 months ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04.4 LTS
Kernel Version Linux ubuntu 5.15.0-102-generic
Architecture x86_64
OpenZFS Version zfs-2.1.5-1ubuntu6~22.04.4

Describe the problem you're observing

Writing two bytes to a file in series using io_uring with a fixed buffer never completes. The uring worker goes into an infinite loop and the process cannot be killed with kill -9.

Describe how to reproduce the problem

Here is a simple test-case: https://github.com/ocaml-multicore/ocaml-uring/issues/113

Running this causes the problem for me every time, and two other people saw the problem with an earlier more complicated test (https://github.com/ocaml-multicore/eio/pull/715#issuecomment-2043925492 and https://github.com/ocaml-multicore/eio/pull/715#issuecomment-2066311366).

Another way to reproduce it is to run the tests for the eio_main package with a ZFS home directory:

$ sudo apt install opam
$ opam init --empty
$ opam switch create 5.1.1
$ opam install -t eio_main
[...]
∗ installed eio_linux.1.0
[ hangs ]

Include any warning/errors/backtraces from the system logs

The logs don't show anything initially, but later a warning about a stuck process appears:

Apr 25 13:57:01 ubuntu kernel: INFO: task main.exe:972 blocked for more than 120 seconds.
Apr 25 13:57:01 ubuntu kernel:       Tainted: P           O      5.15.0-102-generic #112-Ubuntu
Apr 25 13:57:01 ubuntu kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 25 13:57:01 ubuntu kernel: task:main.exe        state:D stack:    0 pid:  972 ppid:   927 flags:0x00004006
Apr 25 13:57:01 ubuntu kernel: Call Trace:
Apr 25 13:57:01 ubuntu kernel:  <TASK>
Apr 25 13:57:01 ubuntu kernel:  __schedule+0x24e/0x590
Apr 25 13:57:01 ubuntu kernel:  ? update_curr+0x6f/0x1e0
Apr 25 13:57:01 ubuntu kernel:  schedule+0x69/0x110
Apr 25 13:57:01 ubuntu kernel:  schedule_timeout+0x105/0x140
Apr 25 13:57:01 ubuntu kernel:  __wait_for_common+0xae/0x150
Apr 25 13:57:01 ubuntu kernel:  ? usleep_range_state+0x90/0x90
Apr 25 13:57:01 ubuntu kernel:  wait_for_completion+0x24/0x30
Apr 25 13:57:01 ubuntu kernel:  io_wq_put_and_exit+0xee/0x2f0
Apr 25 13:57:01 ubuntu kernel:  io_uring_cancel_generic.cold+0x119/0x1d8
Apr 25 13:57:01 ubuntu kernel:  ? wait_woken+0x70/0x70
Apr 25 13:57:01 ubuntu kernel:  __io_uring_cancel+0x14/0x20
Apr 25 13:57:01 ubuntu kernel:  do_exit+0xa2/0x3c0
Apr 25 13:57:01 ubuntu kernel:  do_group_exit+0x3b/0xb0
Apr 25 13:57:01 ubuntu kernel:  get_signal+0x150/0x900
Apr 25 13:57:01 ubuntu kernel:  arch_do_signal_or_restart+0xde/0x100
Apr 25 13:57:01 ubuntu kernel:  ? fput+0x13/0x20
Apr 25 13:57:01 ubuntu kernel:  ? __do_sys_io_uring_enter+0x10d/0x540
Apr 25 13:57:01 ubuntu kernel:  ? fput+0x13/0x20
Apr 25 13:57:01 ubuntu kernel:  exit_to_user_mode_loop+0xc4/0x160
Apr 25 13:57:01 ubuntu kernel:  exit_to_user_mode_prepare+0xa0/0xb0
Apr 25 13:57:01 ubuntu kernel:  syscall_exit_to_user_mode+0x27/0x50
Apr 25 13:57:01 ubuntu kernel:  ? __x64_sys_io_uring_enter+0x22/0x30
Apr 25 13:57:01 ubuntu kernel:  do_syscall_64+0x69/0xc0
Apr 25 13:57:01 ubuntu kernel:  ? __x64_sys_io_uring_enter+0x22/0x30
Apr 25 13:57:01 ubuntu kernel:  ? do_syscall_64+0x69/0xc0
Apr 25 13:57:01 ubuntu kernel:  ? __rseq_handle_notify_resume+0x2d/0xc0
Apr 25 13:57:01 ubuntu kernel:  ? exit_to_user_mode_loop+0x10d/0x160
Apr 25 13:57:01 ubuntu kernel:  ? exit_to_user_mode_prepare+0x96/0xb0
Apr 25 13:57:01 ubuntu kernel:  ? syscall_exit_to_user_mode+0x35/0x50
Apr 25 13:57:01 ubuntu kernel:  ? __x64_sys_io_uring_enter+0x22/0x30
Apr 25 13:57:01 ubuntu kernel:  ? do_syscall_64+0x69/0xc0
Apr 25 13:57:01 ubuntu kernel:  ? irqentry_exit_to_user_mode+0x17/0x20
Apr 25 13:57:01 ubuntu kernel:  ? irqentry_exit+0x1d/0x30
Apr 25 13:57:01 ubuntu kernel:  ? exc_page_fault+0x89/0x170
Apr 25 13:57:01 ubuntu kernel:  entry_SYSCALL_64_after_hwframe+0x62/0xcc
Apr 25 13:57:01 ubuntu kernel: RIP: 0033:0x560a5bbfa030
Apr 25 13:57:01 ubuntu kernel: RSP: 002b:00007fff3e2c0198 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
Apr 25 13:57:01 ubuntu kernel: RAX: fffffffffffffffc RBX: 0000000000000000 RCX: 0000560a5bbfa030
Apr 25 13:57:01 ubuntu kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000003
Apr 25 13:57:01 ubuntu kernel: RBP: 00007fff3e2c01c0 R08: 0000000000000000 R09: 0000000000000008
Apr 25 13:57:01 ubuntu kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 0000560a5d2971a0
Apr 25 13:57:01 ubuntu kernel: R13: 00007fff3e2c01d0 R14: 0000000000000000 R15: 0000000000000000
Apr 25 13:57:01 ubuntu kernel:  </TASK>

pidstat -t 1 shows:

10:40:58      UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
10:40:59     1000      1027         -    0.00   99.02    0.00    0.00   99.02     0  main.exe
10:40:59     1000         -      1048    0.00   98.04    0.00    0.98   98.04     0  |__iou-wrk-1027

perf record -g shows:

   - zpl_iter_write                                                                                                  ▒
      - 98.63% zfs_write                                                                                             ▒
         + 27.21% dmu_tx_assign                                                                                      ▒
         + 26.45% dmu_tx_commit                                                                                      ▒
         + 19.34% dmu_write_uio_dbuf                                                                                 ▒
         + 11.55% dmu_tx_hold_write_by_dnode                                                                         ▒
         + 5.20% dmu_tx_create                                                                                       ▒
         + 3.83% dmu_tx_hold_sa                                                                                      ▒
           0.82% zfs_clear_setid_bits_if_necessary   
sempervictus commented 6 months ago

io_uring has been heavily revised since 5.15 - not exactly the most stable/secure codebase in the Linux kernel:

$ git diff v5.15.143..v6.8.7 io_uring/|wc -l
30094

which is roughly on par with all the changes seen to ipv4:

$ git diff v5.15.143..v6.8.7 net/ipv4/|wc -l
30558

for something that's supposed to provide fairly straightforward functionality and interface. Is this really a ZFS bug or is that version of uring doing something it shouldn't?

rincebrain commented 6 months ago

If the behavior is broken only on ZFS, then I wouldn't bet against ZFS doing something wrong.

That said, it's also on 2.1.5 plus whatever Ubuntu patched in this week, I'd suggest trying with 2.1.15 or 2.2.3 as a data point, lest we end up redoing a fix that's already done.

talex5 commented 6 months ago

@avsm reported seeing the same problem with Linux 6.8: https://github.com/ocaml-multicore/eio/pull/715#issuecomment-2066311366

If you can't reproduce it, I'll try with the new Ubuntu 24.04 next week.

talex5 commented 5 months ago

I upgraded to 23.10:

user@ubuntu:~$ zfs version
zfs-2.2.0-0ubuntu1~23.10.3
zfs-kmod-2.2.0-0ubuntu1~23.10.2

It still fails in the same way:

   - zpl_iter_write                                                                                               ▒
      - 98.98% zfs_write                                                                                          ▒
         + 29.49% dmu_tx_assign                                                                                   ▒
         + 21.62% dmu_tx_commit                                                                                   ▒
         + 18.81% dmu_write_uio_dbuf                                                                              ▒
         + 13.33% dmu_tx_hold_write_by_dnode                                                                      ▒
         + 5.92% dmu_tx_create                                                                                    ▒
         + 4.88% dmu_tx_hold_sa                                                                                   ▒
           0.54% zfs_clear_setid_bits_if_necessary

And then to 24.04, which also fails:

user@ubuntu:~$ uname -a
Linux ubuntu 6.8.0-31-generic #31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

user@ubuntu:~$ zfs version
zfs-2.2.2-0ubuntu9
zfs-kmod-2.2.2-0ubuntu9
      - 98.21% zfs_write                                                                                          ▒
         + 27.80% dmu_tx_assign                                                                                   ▒
         + 20.84% dmu_write_uio_dbuf                                                                              ▒
         + 20.79% dmu_tx_commit                                                                                   ▒
         + 13.45% dmu_tx_hold_write_by_dnode                                                                      ▒
         + 5.63% dmu_tx_create                                                                                    ▒
         + 4.43% dmu_tx_hold_sa                                                                                   ▒
           0.71% zfs_clear_setid_bits_if_necessary