Split L2ARC into metadata L2ARC and data L2ARC

tkittich commented 1 month ago

Describe the feature would like to see added to OpenZFS

To cache more metadata in L2ARC, perhaps L2ARC should be split into non-overlapping metadata L2ARC and data L2ARC. Data buffers would go to data L2ARC, and metadata buffers would go to metadata L2ARC.

How will this feature improve OpenZFS?

The goal is to increase L2ARC hit rate by storing most metadata in L2ARC. With this feature, l2arc_mfuonly = 2, and persistent L2ARC, most metadata should eventually be in L2ARC. That should increase the L2ARC hit rate to be way more than 50% (assuming 1 metadata hit per 1 data hit). The size of metadata L2ARC should be automatically and dynamically set to be large enough to store most metadata of the pool. Data L2ARC can take the rest of the cache device. This would essentially make a metadata L2ARC somewhat similar to a special vdev.

Additional context

With l2arc_mfuonly = 2 and #16499 or #16648, there's still no guarantee that most metadata would get stored in L2ARC. Data buffers could get written into L2ARC so fast that they would overwrite metadata. Setting aside enough space for metadata in L2ARC would make the guarantee.

amotin commented 1 month ago

While it should not be difficult to implement it for different vdevs, we would mostly have to specify via vdev property which L2ARC is which and accept more hardware and administrative overhead, which I am not happy about. Implementing it within one device might be slightly more invasive, but should be easier to manage, though setting limits manually might be tricky. And I generally don't believe in required manual tuning in production scale. I think we should first focus on existing issues of the L2ARC implementation I listed several times and then think how to better balance data and metadata writes after that.

tkittich commented 1 month ago

I prefer using one device for both L2ARCs as well. The metadata side could be dynamically adjusted, e.g. 0.1% - 0.3%, according to the pool size.

amotin commented 1 month ago

Generally for metadata I think we should look more towards special vdevs, or a combination of special and normal vdevs to get both speed and reliability at low cost, as one of open PRs around here intents. L2ARC is not good for actively updated metadata and some metadata such as DDT, BRT, spacemaps, dnodes and even directories, etc may be updated a lot.

2TAC commented 1 month ago

L2ARC has at least two advantages compared to special vdevs: Loss of cache devices don't lead to full data loss and you don't need to do a backup/restore in order to see speed improvements for the existing data. Other than that I think special vdevs are great, I just wish I didn't have to move pentabytes of data back and forth to move the associated metadata.

amotin commented 1 month ago

@2TAC For the loss of device could help https://github.com/openzfs/zfs/pull/16185, which I mentioned above, or some form of it. As for backup/restore -- things like that should better be planned in advance, but if it is too late for that, then yea, L2ARC might work in some cases. If you add special and L2ARC same time, the last will try to handle things temporarily while special warms up and takes over.

openzfs / zfs