openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.68k stars 1.76k forks source link

L1 MFU data should be accessed more than twice to be written to L2ARC #16648

Open tkittich opened 1 month ago

tkittich commented 1 month ago

Describe the feature would like to see added to OpenZFS

This is similar to #16499 but should be easier to implement. For l2arc_mfuonly = 2 (L2 cache all metadata (MRU+MFU) but only MFU data), L1 MFU data should be really frequently used (e.g. > 2) to be written to L2ARC. Otherwise less useful data could waste L2ARC write bandwidth and ruin potentially more useful data (e.g. metadata) in L2ARC.

How will this feature improve OpenZFS?

L2ARC hit rate should be way more than 50% with this feature and l2arc_mfuonly = 2. That is, most metadata and most frequently used data should eventually be in the L2ARC. With most metadata in L2ARC, the hit rate should already reach 50%.

Additional context

Perhaps l2arc_mfuonly = 10,20,30 could be used to specify how many times MFU data should be accessed before getting copied into L2ARC. But it would be best if this could be auto-tuned. For example, L2ARC should keep most metadata and store less and less accessed MFU data as L2ARC size gets bigger and bigger.

amotin commented 1 month ago

I think it should start from a different side. First we should separate case when L2ARC is empty from cases of overwrite. In the first case we can write write faster and be less picky. Second, we should actually make L2ARC be able to write faster (closer to the target), since now we often can not reach the target rate without tuning some parameters to extreme. After after that we can actually look for what data are more useful. I don't like hard-coding of the the threshold. It should be adaptive -- we should write the data we consider most important. May be we could introduce some sort of self-learning algorithm, analyzing L2ARC hits for why they were written, but it might be too far forward.

tkittich commented 1 month ago

we should actually make L2ARC be able to write faster (closer to the target), since now we often can not reach the target rate without tuning some parameters to extreme.

I've done a simple flamegraph and it seems l2arc_write_buffers() is taking the most time in l2arc_feed(). From reading the code, my guess is that looping through 4 ARC sublists is taking a lot of time especially when ARC is large. To remove the looping (scanning for eligible meta/data blocks), perhaps eligible blocks could be checked and listed during arc_access() instead. The actual writing would still be in l2arc_write_buffers(). That would break checking just the last l2arc_headroom from the tail. But that might be ok since most L2ARC users would probably use a persistent L2ARC (headroom = 0). All of this looks way too complicated for me though. ^^"

Screenshot 2024-10-25 154422

amotin commented 1 month ago

With writing faster I did not mean it is CPU-bound, but that it should be able to scan multiple sublists evenly and may be use markers similar to eviction code to not rescan the same headers multiple times. After that we could be much less limited by headroom, since the scan should be very cheap, and the only question would be whether we want all evictable ARC to be duplicated in L2ARC. Keeping additional list of candidates would likely be more expensive.