Open talex5 opened 6 months ago
io_uring
has been heavily revised since 5.15 - not exactly the most stable/secure codebase in the Linux kernel:
$ git diff v5.15.143..v6.8.7 io_uring/|wc -l
30094
which is roughly on par with all the changes seen to ipv4:
$ git diff v5.15.143..v6.8.7 net/ipv4/|wc -l
30558
for something that's supposed to provide fairly straightforward functionality and interface. Is this really a ZFS bug or is that version of uring doing something it shouldn't?
If the behavior is broken only on ZFS, then I wouldn't bet against ZFS doing something wrong.
That said, it's also on 2.1.5 plus whatever Ubuntu patched in this week, I'd suggest trying with 2.1.15 or 2.2.3 as a data point, lest we end up redoing a fix that's already done.
@avsm reported seeing the same problem with Linux 6.8: https://github.com/ocaml-multicore/eio/pull/715#issuecomment-2066311366
If you can't reproduce it, I'll try with the new Ubuntu 24.04 next week.
I upgraded to 23.10:
user@ubuntu:~$ zfs version
zfs-2.2.0-0ubuntu1~23.10.3
zfs-kmod-2.2.0-0ubuntu1~23.10.2
It still fails in the same way:
- zpl_iter_write ▒
- 98.98% zfs_write ▒
+ 29.49% dmu_tx_assign ▒
+ 21.62% dmu_tx_commit ▒
+ 18.81% dmu_write_uio_dbuf ▒
+ 13.33% dmu_tx_hold_write_by_dnode ▒
+ 5.92% dmu_tx_create ▒
+ 4.88% dmu_tx_hold_sa ▒
0.54% zfs_clear_setid_bits_if_necessary
And then to 24.04, which also fails:
user@ubuntu:~$ uname -a
Linux ubuntu 6.8.0-31-generic #31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
user@ubuntu:~$ zfs version
zfs-2.2.2-0ubuntu9
zfs-kmod-2.2.2-0ubuntu9
- 98.21% zfs_write ▒
+ 27.80% dmu_tx_assign ▒
+ 20.84% dmu_write_uio_dbuf ▒
+ 20.79% dmu_tx_commit ▒
+ 13.45% dmu_tx_hold_write_by_dnode ▒
+ 5.63% dmu_tx_create ▒
+ 4.43% dmu_tx_hold_sa ▒
0.71% zfs_clear_setid_bits_if_necessary
System information
Describe the problem you're observing
Writing two bytes to a file in series using io_uring with a fixed buffer never completes. The uring worker goes into an infinite loop and the process cannot be killed with
kill -9
.Describe how to reproduce the problem
Here is a simple test-case: https://github.com/ocaml-multicore/ocaml-uring/issues/113
Running this causes the problem for me every time, and two other people saw the problem with an earlier more complicated test (https://github.com/ocaml-multicore/eio/pull/715#issuecomment-2043925492 and https://github.com/ocaml-multicore/eio/pull/715#issuecomment-2066311366).
Another way to reproduce it is to run the tests for the
eio_main
package with a ZFS home directory:Include any warning/errors/backtraces from the system logs
The logs don't show anything initially, but later a warning about a stuck process appears:
pidstat -t 1
shows:perf record -g
shows: