openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.32k stars 1.71k forks source link

plugable ARC #10516

Open zenaan opened 4 years ago

zenaan commented 4 years ago

ARC and Linux Page Cache duplicate one another re data cached.

ARC has various memory management issues associated with it, due to being a bolt on and not integrated with Linux's Page Cache.

ARC also has performance and tuning issues because of the above.

OpenZFS is 'bolted on' to a number of kernels - IllumOS, BSD, Linux etc.

A first step to rationalizing the ZFS ARC issues, may be to make it plugable/ optional.

This would require at least some thought and clean up to the "ARC" API if we can call it that.

Once the existing API is plugable/optional, this should provide a "code environment" in which experiments are much easier:

Other than this, the requirement for OpenZFS to be cross-OS compatible appears to be too burdensome to inspire the effort to improve it, e.g. see here:

http://freebsd.1045724.x6.nabble.com/ZFS-ARC-and-mmap-page-cache-coherency-question-td6110500.html

Jul 06, 2016; 4:40am Re: ZFS ARC and mmap/page cache coherency question Lionel Cons

So what Oracle did (based on work done by SUN for Opensolaris) was to:

  1. Modify ZFS to prevent ANY double/multi caching [this is considered a design defect]
  2. Introduce a new VM subsystem which scales a lot better and provides hooks for [1] so there are never two or more copies of the same data in the system

Given that this was a huge, paid, multiyear effort its not likely going to happen that the design defects in opensource ZFS will ever go away.

Lionel

(See usual URLs such as https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/)

zenaan commented 4 years ago

See also:

Feature Request - Adopt OracleZFS's merged Data-Metadata ARC model #3946

ZoL not really willing to evict caches from RAM when needed #6740

arc_reclaim pegs CPU and blocks I/O #7559

Low zfs list performance #8898

Observed arc evictions to arc_c_min on large memory system #9548

zfs unnecessary reads to cache? 0.8.2-1, 4.18.0-80.el8.x86_64 SMP mod_unload modversions #9557

100% CPU load from arc_prune #9966

Reproducible system crashes when half of installed ram is assigned to hugepagesz #10048

arc_prune threads end up consuming 50% or more CPU and leads to reduced I/O write performance on a SSD pool. #10222

ARC not accounted as MemAvailable in /proc/meminfo #10255

High unaccounted memory usage #10302

bghira commented 3 years ago

sounds like a maintenance nightmare. i look forward to your PR.

snajpa commented 3 years ago

ARC is IMHO the best feature and most important differentiator of OpenZFS from other storage systems. I agree that double caching is bad, but I'd much rather welcome the ability to disable OpenZFS contents to be cached with Linux pagecache. It would make more sense to me, as pagecache has abysmal performance in most multitenant environments I have ever seen, whereas OpenZFS always saves the day and enables those workloads to achieve >90% hitrates >90% of the time (well, honestly, it's more like >98% hitrate >98% of the time, even better). Something pagecache can only dream of :)

I'm just not sure this is possible with current Linux - AFAIK it isn't - so Linux is the place, where it makes the most sense to go and do something this about this issue.

IsaacVaughn commented 3 years ago

I think it would be helpful if this issue could be broken down into smaller components which can actually be addressed instead of an all-or-nothing request to get rid of ARC entirely.

Do we have any performance data confirming that double caching is a problem outside of mmap? I have never seen the page cache using huge amounts of memory on my personal ZFS systems. Some data: ZFS on root desktop: 11.4G ARC, 896M PC ZFS data, xfs root server: 3.8G ARC, 4.5G PC

At first glance, it looks like the PC is mostly being filled by non-ZFS filesystems rather than ARC contents. Avoiding extra copies is great, but this issue doesn't make it clear if the issue is widespread or specific to mmap.

There is also a separate problem brought up of the ARC's responsiveness to memory pressure in other parts of the system. From the freebsd issue, it sounds like this could be solved if the kernels provided a means to signal that ARC should be evicted before attempting to page out to swap. e.g. Some sort of way to register "I have xGB of low priority memory, when memory pressure increases tell me to drop some." @snajpa is probably correct; this needs to be implemented in Linux first. It also might already exist, but be hidden behind a GPL export.

Edit: The freebsd issue also mentions per-vdev write buffering, which would be very nice. A less granular per-pool buffer might be more appropriate though. SSD and HDD pools have very different performance characteristics, and NVME is only making the problem worse.

codyps commented 3 years ago

There is also a separate problem brought up of the ARC's responsiveness to memory pressure in other parts of the system. From the freebsd issue, it sounds like this could be solved if the kernels provided a means to signal that ARC should be evicted before attempting to page out to swap. e.g. Some sort of way to register "I have xGB of low priority memory, when memory pressure increases tell me to drop some." @snajpa is probably correct; this needs to be implemented in Linux first. It also might already exist, but be hidden behind a GPL export.

On linux, zfs already implements the "shrinker" api to support being notified about memory pressure that indicates caches need to shrink. (I suppose It's possible there are issues with that API's usage in the linux kernel or zfs's impl of it)

ednadolski-ix commented 8 months ago

As this is approaching 2 years old, is it (still?) considered to be a feasible ask? (N.B. that the cited work is dated, and was done on Solaris, not Linux or FreeBSD.)