Closed shaan1337 closed 2 years ago
After some more investigation, I've made a few changes to the initial issue. it definitely looks related to zfs since:
Could try #12284 and see if it makes life any better.
Could also try twiddling WBT off - there have been issues with huge amounts of IO contention eventually deadlocking with that, and I don't know whether the fix for that made it into stable.
@rincebrain thanks!
we're not using an SLOG on these machines, so it doesn't appear that #12284 would apply in our case. we'll try to turn off write-back throttling to see if this fixes it.
We've turned off WBT (with echo 0 > /sys/block/sdb/queue/wbt_lat_usec
) and we can see that the issue still occurs.
i've finally got a consistent repro: repro.working.zip
It requires changing the kernel's dirty page writeback frequency to make it much more reproducible:
sudo sysctl -w "vm.dirty_expire_centisecs=1"
sudo sysctl -w "vm.dirty_background_bytes=1"
sudo sysctl -w "vm.dirty_writeback_centisecs=1"
It appears that the delay occurs when both an application and the kernel are writing back the same dirty page. the relevant part of the code is: https://github.com/openzfs/zfs/blob/269b5dadcfd1d5732cf763dddcd46009a332eae4/module/os/linux/zfs/zfs_vnops_os.c#L3528
~/repro
[2021-11-18 10:31:31] slow msync: 2303.270000 ms
[2021-11-18 10:31:47] slow msync: 4050.328000 ms
[2021-11-18 10:31:57] slow msync: 2838.579000 ms
[2021-11-18 10:32:02] slow msync: 4380.869000 ms
[2021-11-18 10:32:22] slow msync: 4681.770000 ms
[2021-11-18 10:32:28] slow msync: 2799.192000 ms
[2021-11-18 10:32:33] slow msync: 3692.178000 ms
[2021-11-18 10:32:38] slow msync: 5001.400000 ms
[2021-11-18 10:32:43] slow msync: 3201.233000 ms
...
The page stays stuck in the write back state until the transaction group containing that write closes. As a workaround, the zfs transaction group timeout can be reduced but it will cause more frequent writes to stable storage (set to 1 second below):
echo 1 > /sys/module/zfs/parameters/zfs_txg_timeout
System information
Describe the problem you're observing
When running EventStoreDB (https://github.com/EventStore/EventStore) on a Google Cloud VM, we occasionally see writes taking several seconds in the logs (it happens 1-2 times per day with a constant write load):
In the above case, it took ~1.5 seconds to complete a write but sometimes it can be up to ~15 seconds.
After doing an strace, I've come to the conclusion that msync is the culprit. We have a small memory-mapped file (8 bytes large) which is updated and flushed very regularly (every ~2 milliseconds).
We don't seem to see the same problem on other cloud platforms (AWS, Azure) although they should be running in a similar environment.EDIT: The issue also seems to happen on AWS/AzureAfter doing an ftrace on msync (
__x64_sys_msync
) I've got the following output which shows that the delay is incurred inwait_on_page_writeback
:I've done an additional ftrace, but this time only on wait_on_page_bit() and I've got this output:
It looks like the kernel is waiting for ZFS to complete the page write to stable storage.
Describe how to reproduce the problem
SLOW QUEUE
messages after 1 day in the EventStoreDB logsI'm attempting a minimal repro with a C application which repeatedly flushes an 8-byte memory-mapped file and I've seen the issue happen once so it confirms that the issue is not with EventStoreDB (i'm trying to get the repro to work consistently).
Include any warning/errors/backtraces from the system logs
Nothing in the syslogs. The machine is quite powerful as well and has plenty of free memory.