Boris-Barboris commented 4 years ago

System information

Type	Version/Name
Distribution Name	Arch Linux
Distribution Version	rolling
Linux Kernel	Linux 5.2.7-arch1-1-ARCH #3 SMP, PREEMPT_VOLUNTARY
Architecture	x86-64
ZFS Version	0.8.2
SPL Version

Describe the problem you're observing

I am observing unfair sync write scheduling and severe userspace process io starvation in certain situation. It appears to be that fsync call on a file with a lot of unwritten dirty data will stall the system and cause FIFO-like sync write order, where no other processes get their share untill the dirty data is flushed. On my home system this causes severe stalls when the guest VM with cache=writeback virtio-scsi disk decides to sync the scsi barrier while having a lot of dirty data in the hypervisor's RAM. All other hypervisor writers block completely and userspace starts chimping out with various timeouts and locks. It effectively acts as a DoS.

Describe how to reproduce the problem

1). Prepare a reasonably-default dataset.

zfs get all rootpool1/arch1/varcache
NAME                      PROPERTY              VALUE                  SOURCE
rootpool1/arch1/varcache  type                  filesystem             -
rootpool1/arch1/varcache  creation              Thu Aug 15 23:37 2019  -
rootpool1/arch1/varcache  used                  13.4G                  -
rootpool1/arch1/varcache  available             122G                   -
rootpool1/arch1/varcache  referenced            13.4G                  -
rootpool1/arch1/varcache  compressratio         1.08x                  -
rootpool1/arch1/varcache  mounted               yes                    -
rootpool1/arch1/varcache  quota                 none                   default
rootpool1/arch1/varcache  reservation           none                   default
rootpool1/arch1/varcache  recordsize            128K                   default
rootpool1/arch1/varcache  mountpoint            /var/cache             local
rootpool1/arch1/varcache  sharenfs              off                    default
rootpool1/arch1/varcache  checksum              on                     default
rootpool1/arch1/varcache  compression           lz4                    inherited from rootpool1
rootpool1/arch1/varcache  atime                 off                    inherited from rootpool1
rootpool1/arch1/varcache  devices               on                     default
rootpool1/arch1/varcache  exec                  on                     default
rootpool1/arch1/varcache  setuid                on                     default
rootpool1/arch1/varcache  readonly              off                    default
rootpool1/arch1/varcache  zoned                 off                    default
rootpool1/arch1/varcache  snapdir               hidden                 default
rootpool1/arch1/varcache  aclinherit            restricted             default
rootpool1/arch1/varcache  createtxg             920                    -
rootpool1/arch1/varcache  canmount              on                     default
rootpool1/arch1/varcache  xattr                 sa                     inherited from rootpool1
rootpool1/arch1/varcache  copies                1                      default
rootpool1/arch1/varcache  version               5                      -
rootpool1/arch1/varcache  utf8only              off                    -
rootpool1/arch1/varcache  normalization         none                   -
rootpool1/arch1/varcache  casesensitivity       sensitive              -
rootpool1/arch1/varcache  vscan                 off                    default
rootpool1/arch1/varcache  nbmand                off                    default
rootpool1/arch1/varcache  sharesmb              off                    default
rootpool1/arch1/varcache  refquota              none                   default
rootpool1/arch1/varcache  refreservation        none                   default
rootpool1/arch1/varcache  guid                  394889745699357232     -
rootpool1/arch1/varcache  primarycache          all                    default
rootpool1/arch1/varcache  secondarycache        all                    default
rootpool1/arch1/varcache  usedbysnapshots       0B                     -
rootpool1/arch1/varcache  usedbydataset         13.4G                  -
rootpool1/arch1/varcache  usedbychildren        0B                     -
rootpool1/arch1/varcache  usedbyrefreservation  0B                     -
rootpool1/arch1/varcache  logbias               latency                default
rootpool1/arch1/varcache  objsetid              660                    -
rootpool1/arch1/varcache  dedup                 off                    default
rootpool1/arch1/varcache  mlslabel              none                   default
rootpool1/arch1/varcache  sync                  standard               default
rootpool1/arch1/varcache  dnodesize             legacy                 default
rootpool1/arch1/varcache  refcompressratio      1.08x                  -
rootpool1/arch1/varcache  written               13.4G                  -
rootpool1/arch1/varcache  logicalused           14.5G                  -
rootpool1/arch1/varcache  logicalreferenced     14.5G                  -
rootpool1/arch1/varcache  volmode               default                default
rootpool1/arch1/varcache  filesystem_limit      none                   default
rootpool1/arch1/varcache  snapshot_limit        none                   default
rootpool1/arch1/varcache  filesystem_count      none                   default
rootpool1/arch1/varcache  snapshot_count        none                   default
rootpool1/arch1/varcache  snapdev               hidden                 default
rootpool1/arch1/varcache  acltype               off                    default
rootpool1/arch1/varcache  context               none                   default
rootpool1/arch1/varcache  fscontext             none                   default
rootpool1/arch1/varcache  defcontext            none                   default
rootpool1/arch1/varcache  rootcontext           none                   default
rootpool1/arch1/varcache  relatime              off                    default
rootpool1/arch1/varcache  redundant_metadata    all                    default
rootpool1/arch1/varcache  overlay               on                     local
rootpool1/arch1/varcache  encryption            off                    default
rootpool1/arch1/varcache  keylocation           none                   default
rootpool1/arch1/varcache  keyformat             none                   default
rootpool1/arch1/varcache  pbkdf2iters           0                      default
rootpool1/arch1/varcache  special_small_blocks  0                      default

2). Prepare 2 terminal tabs, cd to this dataset mount point. In them prepare the following fio commands: "big-write"

fio --name=big-write --ioengine=sync --rw=write --bs=32k --direct=1 --size=2G --numjobs=1 --end_fsync=1

and "small-write"

fio --name=small-write --ioengine=sync --rw=write --bs=128k --direct=1 --size=128k --numjobs=1 --end_fsync=1

3). Let them run once to prepare the necessary benchmark files. In the meantime observe the iostat on the pool:

zpool iostat -qv rootpool1 0.1
                                                         capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool                                                    alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
------------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
rootpool1                                                273G   171G      0  7.15K      0   598M      0      0      0      0      0      0    213      8      0      0      0      0
  mirror                                                 273G   171G      0  7.15K      0   597M      0      0      0      0      0      0    213      8      0      0      0      0
    ata-PNY_CS900_480GB_SSD_PNY111900057103045C0-part2      -      -      0  3.57K      0   300M      0      0      0      0      0      0    105      4      0      0      0      0
    ata-Patriot_Burst_9128079B175300025792-part2            -      -      0  3.56K      0   297M      0      0      0      0      0      0    108      4      0      0      0      0
------------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
...
                                                         capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool                                                    alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
------------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
rootpool1                                                274G   170G      0  7.19K      0   565M      0      0      0      7      0      0      0      0      0      0      0      0
  mirror                                                 274G   170G      0  7.18K      0   565M      0      0      0      7      0      0      0      0      0      0      0      0
    ata-PNY_CS900_480GB_SSD_PNY111900057103045C0-part2      -      -      0  3.59K      0   282M      0      0      0      3      0      0      0      0      0      0      0      0
    ata-Patriot_Burst_9128079B175300025792-part2            -      -      0  3.59K      0   282M      0      0      0      4      0      0      0      0      0      0      0      0
------------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Note that when fio issues 2G of async writes it calls fsync at the very end, wich moves them from async to sync class.

4). When fios are finished, do the following: start "big-write" and then after 2-3 seconds (when "Jobs: 1" appears) start the "small-write". Note that the small 128K write will never finish before the 2G one, and the second fio remains blocked until the first one finishes.

Include any warning/errors/backtraces from the system logs

Boris-Barboris commented 4 years ago

Lowering zfs_dirty_data_max significantly (to 100-200M values from default 3G) mitigates the problem for me, but with 50% performance drop.

Boris-Barboris commented 4 years ago

After some code investigation the problem appears to be too deeply ingrained in the write path. There are multiple problems that cause this:

Writer task_id is effectively lost on VFS layer (vnops). Apart from zio priority, there is no QoS-related data tracking for dmu itxses and zios. While current()'s task_id can be added to these structures, there is no data to track any type of QoS at this moment.
DMU throttle that implements the zfs_dirty_data_max logic and bounds the total size of dirty dbufs of a transaction group is unfair. Assuming very busy system with one full txg already syncing, single process can eat up all ram and will be as throttled as the interactive processes that arrive later, after single ddoser's write flow. Transaction allocation is effectively as fair as cpu scheduler, relying on thread priorities. Even worse, unless their files are open with O_SYNC flag, latency-sentive applications that rely on multiple writes + fsync still go through DMU throttle and receive a severe penalty to their write speed.
ZIL is serialized per-dataset. I repeated the experiment from the OP on two separate datasets and the second small fio indeed managed to complete before the first big one in some cases. The speed of the second fio was still abyssimal, as nothing in the SPA/ZIO guarantees fairness to zils. Such FIFO ordering is incompatible with any form of bandwidth sharing, and this is the original cause of my problems: ZIL processes 2G of first fio and only then 128K of small fio.
SPA allocation throttle and vdev schedulers are oblivious to the source of zios they handle. They are not fair to processes nor to any other entities, and only look at graph ordering of zios and their priority class.

I am afraid my workaround is currently the only viable option for acceptable latency under overwhelming fsync load. ZFS is nor designed nor built to be bandwidth-fair to consumer entities.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Boris-Barboris commented 3 years ago

It's a design issue, so I guess a fix probability is effectively zero. Still, a desirable feature in both desktop and multi-tenant headless environments. Let's hear out developers on the subject of complexity and then close/wontfix it.

gdevenyi commented 3 years ago

Probably related to #11140

devZer0 commented 2 years ago

apparently, we had this in the past and maybe i was wrong that it was resolved?

https://github.com/openzfs/zfs/issues/4603

anyhow, what about https://github.com/openzfs/zfs/pull/11929#issuecomment-829710720 and https://github.com/openzfs/zfs/pull/11912 ?

Boris-Barboris commented 2 years ago

apparently, we had this in the past and maybe i was wrong that it was resolved?

4603

anyhow, what about #11929 (comment) and #11912 ?

I can still reproduce it on 0.8.4:

main big write:
WRITE: bw=62.0MiB/s (65.0MB/s), 62.0MiB/s-62.0MiB/s (65.0MB/s-65.0MB/s), io=2048MiB (2147MB), run=33055-33055msec
starved small write:
WRITE: bw=5384B/s (5384B/s), 5384B/s-5384B/s (5384B/s-5384B/s), io=128KiB (131kB), run=24344-24344msec

dd example from #4603 seems incorrect, I did not see any non-zero numbers in syncq_write column of zpool iostat while running it.

The solutions/comments you linked revolve around queue depth limitation and reducing the latency at the cost of bandwidth reduction. This will not replace a theoretical writer-aware fair scheduler, that could fix the latency without making a write queue universally-shallow.

devZer0 commented 2 years ago

@stale stale bot closed this on 25 Nov 2020

4603 was open for 4 years with no activity and then closed by stale bot. not the best way to handle issues, and this is a real one.

our whole proxmox (with local zfs storage) migration from xenserver stalls because of this for weeks now and all our effort up to now may go poof if we don't get this resolved.

but besides the fsync stall there may be other stalling issues in kvm. i put this bugreport for reference, as it is at least related: https://bugzilla.kernel.org/show_bug.cgi?id=199727

i also think this is a significant one, rendering kvm on top of zfs really unusable when you have fsync centric workloads.

thanks for reporting the details about it and for your analysis !

Boris-Barboris commented 2 years ago

our whole proxmox (with local zfs storage) migration from xenserver stalls because of this for weeks now and all our effort up to now may go poof if we don't get this resolved.

Yeah, if you're multi-tenant (many VMs) you'll have better luck with boring qcow2s on ext/xfs+raid.

devZer0 commented 2 years ago

no option. i want zfs snapshots and replication and i care about my data.

recklessnl commented 2 years ago

@behlendorf wondering what your thoughts on this issue are. I'm on Proxmox as well and have occasionally noticed the same thing as @Boris-Barboris and @devZer0 have.

devZer0 commented 2 years ago

this is a real bummer.

this following output delays happen for example by simply copying a file inside a virtual machine

you can clearly see that sync io from ioping is getting completely starved inside the VM. I guess i have never seen a single IO need 5.35min for completion.

[root@gitlab backups]# ioping -WWWYy ioping.dat 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=43 time=87.7 ms 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=44 time=65.9 ms 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=45 time=24.9 ms (fast) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=46 time=25.4 ms (fast) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=47 time=42.9 ms 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=48 time=16.3 s (slow) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=49 time=5.35 min (slow) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=50 time=99.5 ms (fast) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=51 time=311.5 ms (fast) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=52 time=322.6 ms (fast) 4 KiB >>> ioping.dat (xfs /dev/dm-3): request=53 time=49.9 ms (fast)

this long starvation causing the following kernel message :

[87720.075195] INFO: task xfsaild/dm-3:871 blocked for more than 120 seconds. [87720.075320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87720.075384] xfsaild/dm-3 D ffff8c48f43605e0 0 871 2 0x00000000 [87720.075387] Call Trace: [87720.075405] [] schedule+0x29/0x70 [87720.075458] [] xfs_log_force+0x157/0x2e0 [xfs] [87720.075468] [] ? wake_up_state+0x20/0x20 [87720.075483] [] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] [87720.075494] [] xfsaild+0x190/0x780 [xfs] [87720.075504] [] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] [87720.075507] [] kthread+0xd1/0xe0 [87720.075509] [] ? insert_kthread_work+0x40/0x40 [87720.075513] [] ret_from_fork_nospec_begin+0x21/0x21 [87720.075515] [] ? insert_kthread_work+0x40/0x40

i'm not completely sure if this is a zfs problem alone, as with "zpool iostat -w hddpool 1" i would expect to see the outstanding IO from ioping (which hangs for minuts) in syncq_wait queue, but in 137s row, there is no single IO shown. is there a way to make this visible on ZFS layer ?

hddpool latency ---------- 1ns 3ns 7ns 15ns 31ns 63ns 127ns 255ns 511ns 1us 2us 4us 8us 16us 32us 65us 131us 262us 524us 1ms 2ms 4ms 8ms 16ms 33ms 67ms 134ms 268ms 536ms 1s 2s 4s 8s 17s 34s 68s 137s ----------------- total_wait disk_wait syncq_wait asyncq_wait read write read write read write read write scrub trim ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 32 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 6 0 0 0 11 0 11 0 0 0 5 0 0 0 17 0 17 0 0 0 5 0 0 0 5 0 8 0 0 0 0 0 0 0 6 0 3 0 0 0 0 0 0 0 2 0 4 0 0 0 0 0 0 0 8 0 6 0 0 0 3 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 5 0 18 0 0 0 6 0 0 0 18 0 39 0 0 0 3 0 0 0 5 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 32 0 0 0 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ---------------------------------------------------------------

devZer0 commented 2 years ago

i know this can be mitigated to some point with adding a SLOG, but we have a ssd and a hdd mirror or raidz on each hypervisor server and adding another enterprise ssd just to make the hdd's run without a problem feels a little bit ugly, as you could better switch to "ssd only" system then.

KernelBypass commented 1 year ago

Any updates? Are there plans to resolve this in the future versions?

ShadowJonathan commented 1 year ago

Why hasn't this been escalated as a serious issue? Performance before features imo

RubenKelevra commented 1 year ago

So it's kinda a design limitation: Normally a filesystem is offering a mount and is accessing a disk. The fairness here is provided by the IO-scheduler which is attributing the individual requests to the processes issuing these.

ZFS however isn't working that way. The actual IO to the disks are issued by ZFS processes. The scheduler therefore cannot "see" which application is behind individual IO.

In addition ZFS has its own scheduler built-in and thus an IO scheduler below isn't considered helpful, as the IO gets optimized for low latency by making sure they get issued in an order which completes individual requests as fast as possible.

The scheduler is also sorting the requests to complete synchronous reads first and synchronous writes with second priority, followed by asynchronous reads and then asynchronous writes.

These priorities are not super strict however: The amount of concurrent queues for each type of the described IO classes are tuned up and down, based on the outstanding requests. The tuneable you modified is adjusting the limit of how many writes can be cached.

Lowering this value increases the amount of writing threads earlier, as the thresholds are percentages. In addition ZFS starts to throttle the incoming IO by applications, by introducing a sleep time if this cache gets fuller.

So there are a couple of things you could try to lower the impact of issues you're seeing:

Reduce parallel IO jobs per vdev

check if your disks can keep up with the amount of concurrency:

Open atop as root with atop 1.
Have a look on the header for your disk(s) running the pool
Check how large the latency for the IO spikes on them (last value on the right)

If it's often above say 15ms (on SSDs)/50ms (HDDs) the disk has trouble keeping up with the amount of concurrent IO.

The tuneable zfs_vdev_max_active allows you to lower the maximum amount of concurrent read/write jobs.

Earlier throttling and adjusting the delay introduced for throttling

Instead of lowering the maximum amount of "dirty" data for async writes its better IMHO to adjust on what percentage ZFS starts throttling the writes accepted by applications.

The threshold can be configured with zfs_dirty_data_sync_percent.

In addition zfs_delay_scale needs to be adjusted. It should be set to 1 billion divided by the IOPS possible on your pool (says the docs).

It's probably best to adjust that to a mix of random and sequential IO tested on one disk. Depending on your pool layout you need to multiply that:

If your pool contains multiple mirrors or zraids, you need to multiply the single disk IO with the amount of mirror- or raidz-groups.
If you use multiple disks in a "plain" configuration, just extending the pool over multiple disks, multiply it with the amount of disks.

@ShadowJonathan wrote:

Why hasn't this been escalated as a serious issue? Performance before features imo

It has, there's a feature request open to implement the missing feature: Balancing IO between processes and creating IONice levels as well, so background IO can be marked as such.

See #14151

devZer0 commented 6 months ago

hello, does this new feature of sync parallism help adressing this problem ?

https://www.phoronix.com/news/OpenZFS-Sync-Parallelism

openzfs / zfs

Lack of fairness of sync writes #10110

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

4603

4603 was open for 4 years with no activity and then closed by stale bot. not the best way to handle issues, and this is a real one.

Reduce parallel IO jobs per vdev

Earlier throttling and adjusting the delay introduced for throttling