openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.5k stars 1.74k forks source link

parallel msync deadlock #12702

Open shaan1337 opened 2 years ago

shaan1337 commented 2 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04
Kernel Version 5.11.0-1021-gcp (Google Cloud)
Architecture x86_64
OpenZFS Version zfs-0.8.3-1ubuntu12.12, zfs-kmod-2.0.2-1ubuntu5.1
Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04
Kernel Version 5.4.0-1045-aws (AWS EC2)
Architecture x86_64
OpenZFS Version zfs-2.1.1-1, zfs-kmod-2.1.1-1

Describe the problem you're observing

In an attempt to reproduce: https://github.com/openzfs/zfs/issues/12662, I've come across a reproducible deadlock by doing msync in parallel to the same file.

Although the stack trace looks similar to #12662, I'm not sure if it's the same issue or if they are related at all. They both wait for a page to go out of the page writeback state but in #12662 it happens only temporarily. In this case it seems to be a deadlock and the system needs to be rebooted.

Describe how to reproduce the problem

repro.zip

$ gcc repro.c -o repro -lpthread
$ ./repro

You may need to be run the application a few times for the issue to occur. If it's not happening, you can also replace for(int i=0;i<10;i++){ with for(;;){ and it should occur more predictably.

Include any warning/errors/backtraces from the system logs

The application hangs and after a few minutes, the following can be seen in the dmesg output: dmesg.log

An attempt to read the file writer.chk hangs as well.

shaan1337 commented 2 years ago

Issue also happens with latest ZFS version (zfs-2.1.1-1/zfs-kmod-2.1.1-1)

shaan1337 commented 2 years ago

I've turned off WBT on the disks as suggested by @rincebrain in #12662 and the deadlock still occurs

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf commented 1 year ago

Reopening until we can verify this has been resolved with the provided reproducer.