Open zenaan opened 4 years ago
See also:
Feature Request - Adopt OracleZFS's merged Data-Metadata ARC model #3946
ZoL not really willing to evict caches from RAM when needed #6740
arc_reclaim pegs CPU and blocks I/O #7559
Low zfs list performance #8898
Observed arc evictions to arc_c_min on large memory system #9548
zfs unnecessary reads to cache? 0.8.2-1, 4.18.0-80.el8.x86_64 SMP mod_unload modversions #9557
100% CPU load from arc_prune #9966
Reproducible system crashes when half of installed ram is assigned to hugepagesz #10048
sounds like a maintenance nightmare. i look forward to your PR.
ARC is IMHO the best feature and most important differentiator of OpenZFS from other storage systems. I agree that double caching is bad, but I'd much rather welcome the ability to disable OpenZFS contents to be cached with Linux pagecache. It would make more sense to me, as pagecache has abysmal performance in most multitenant environments I have ever seen, whereas OpenZFS always saves the day and enables those workloads to achieve >90% hitrates >90% of the time (well, honestly, it's more like >98% hitrate >98% of the time, even better). Something pagecache can only dream of :)
I'm just not sure this is possible with current Linux - AFAIK it isn't - so Linux is the place, where it makes the most sense to go and do something this about this issue.
I think it would be helpful if this issue could be broken down into smaller components which can actually be addressed instead of an all-or-nothing request to get rid of ARC entirely.
Do we have any performance data confirming that double caching is a problem outside of mmap? I have never seen the page cache using huge amounts of memory on my personal ZFS systems. Some data: ZFS on root desktop: 11.4G ARC, 896M PC ZFS data, xfs root server: 3.8G ARC, 4.5G PC
At first glance, it looks like the PC is mostly being filled by non-ZFS filesystems rather than ARC contents. Avoiding extra copies is great, but this issue doesn't make it clear if the issue is widespread or specific to mmap.
There is also a separate problem brought up of the ARC's responsiveness to memory pressure in other parts of the system. From the freebsd issue, it sounds like this could be solved if the kernels provided a means to signal that ARC should be evicted before attempting to page out to swap. e.g. Some sort of way to register "I have xGB of low priority memory, when memory pressure increases tell me to drop some." @snajpa is probably correct; this needs to be implemented in Linux first. It also might already exist, but be hidden behind a GPL export.
Edit: The freebsd issue also mentions per-vdev write buffering, which would be very nice. A less granular per-pool buffer might be more appropriate though. SSD and HDD pools have very different performance characteristics, and NVME is only making the problem worse.
There is also a separate problem brought up of the ARC's responsiveness to memory pressure in other parts of the system. From the freebsd issue, it sounds like this could be solved if the kernels provided a means to signal that ARC should be evicted before attempting to page out to swap. e.g. Some sort of way to register "I have xGB of low priority memory, when memory pressure increases tell me to drop some." @snajpa is probably correct; this needs to be implemented in Linux first. It also might already exist, but be hidden behind a GPL export.
On linux, zfs already implements the "shrinker" api to support being notified about memory pressure that indicates caches need to shrink. (I suppose It's possible there are issues with that API's usage in the linux kernel or zfs's impl of it)
As this is approaching 2 years old, is it (still?) considered to be a feasible ask? (N.B. that the cited work is dated, and was done on Solaris, not Linux or FreeBSD.)
https://github.com/openzfs/zfs/issues/10516#issuecomment-651621619
See also:
@zenaan, a hint: if you make bullet points for the linked items, and use the URLs alone, then GitHub might automatically show each item's title and status.
ARC and Linux Page Cache duplicate one another re data cached.
ARC has various memory management issues associated with it, due to being a bolt on and not integrated with Linux's Page Cache.
ARC also has performance and tuning issues because of the above.
OpenZFS is 'bolted on' to a number of kernels - IllumOS, BSD, Linux etc.
A first step to rationalizing the ZFS ARC issues, may be to make it plugable/ optional.
This would require at least some thought and clean up to the "ARC" API if we can call it that.
Once the existing API is plugable/optional, this should provide a "code environment" in which experiments are much easier:
performance comparisons, ARC vs no ARC
API experiments and evolution
custom per-OS and/or per-deployment "ARC module"s
etc
Other than this, the requirement for OpenZFS to be cross-OS compatible appears to be too burdensome to inspire the effort to improve it, e.g. see here:
http://freebsd.1045724.x6.nabble.com/ZFS-ARC-and-mmap-page-cache-coherency-question-td6110500.html
(See usual URLs such as https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/)