openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.45k stars 1.73k forks source link

L2ARC shall not lose valid pool metadata #10957

Open zfsuser opened 3 years ago

zfsuser commented 3 years ago

Describe the feature would like to see added to OpenZFS

Requirements:

Idea:

How will this feature improve OpenZFS?

Additional context

Condition:

Remarks:

Tunables:

Observables:

amotin commented 3 years ago

To me this sounds like additional complication with no obvious benefits. ZFS already has small non-evictable metadata cache in RAM for the most important pool metadata. On top of that, normal ARC and L2ARC operation should ensure that (meta-)data accessed at least sometimes should be cached. If for some reason you need all of your metadata to reside on SSDs, just add special metadata vdev to your pool, that will be much more efficient from all perspectives than use of L2ARC. L2ARC should be used for cases where you can not predict active data set in advance, and in that context making some (meta-)data more special than others even if accessed only rarely is a step in wrong direction.

From purely mechanical since, I think there will be a problem with checksum verification. Since L2ARC header in RAM does not store it, unless there is actual read request with full block pointer, the code reloading blocks from L2ARC into ARC won't be able to verify the checksum.

zfsuser commented 3 years ago

The motivation is the wish to have a L2ARC which stores data and metadata, but prioritizes metadata. Basically behaving as with secondarycache=metadata, but in addition also storing data on opportunity bases. Have your cake and eat it too.

Without requiring a complete redesign of the L2ARC. Without requiring separate partitions for data and metadata, and a secondarycache property which can be configured per L2ARC top level vdev instead of once per pool, and in the end would most likely result in ineffective use of the physical L2ARC vdev.

In the end the idea is to keep the L2ARC as it is, and just prevent losing perfectly fine pool metadata when its storage area in the persistent L2ARC is being overwritten. The idea is not to store the complete pool metadata in the L2ARC, but yes, it could happen based on L2ARC size, tunables and access patterns

The special vdevs are very interesting but require interface-ports and drive-slots. And as the redundancy should be no less than that of the data disk of the pool, a raidz2 pool would require the ability to house and connect ~3 additional drives. While this is no issue for big irons, for SOHO it is quite often not possible.

Keeping rarely accessed metadata in the L2ARC should not be an issue. The L2ARC just have to be bigger than 0.1% (128kiB blocksize) to ~3% (4kiB blocksize) of the pool size, and/or a tunable like vfs.zfs.l2arc.meta_limit_percent has to be set to a value <100%. The tunable would ensure that enough of the L2ARC is available for random access (non-meta)data.

Regarding your point about zfs mechanics, do i understand your explanation correctly?:

Normally a block is read from the L2ARC by following a pointer stored in its parent block/buffer, which also contains the checksum of the L2ARC block? So if we would try to just read back L2ARC blocks, we would have no parent block and so would be missing the checksum to verify that the block was not corrupted?

Is this not a problem applying also to reading back the persistent L2ARC? Was this solved with the log-blocks? If yes, couldn't we use those logblocks to check the data is uncorrupted?

richardelling commented 3 years ago

FYI, in Solaris 11, the metadata/data separation has been removed entirely. Can we be sure keeping the complexity of separate metadata/data caching is worth the trouble?

amotin commented 3 years ago

Normally a block is read from the L2ARC by following a pointer stored in its parent block/buffer, which also contains the checksum of the L2ARC block? So if we would try to just read back L2ARC blocks, we would have no parent block and so would be missing the checksum to verify that the block was not corrupted?

Right. L2ARC block checksum is identical to normal block checksum, since it uses the same compression/encryption, just stored in different place. It does not require separate storage.

Is this not a problem applying also to reading back the persistent L2ARC? Was this solved with the log-blocks? If yes, couldn't we use those logblocks to check the data is uncorrupted?

Persistent L2ARC does not reload the data into ARC, it only reconstructs previous L2ARC headers on pool import. The log blocks have their own checksums, which don't cover the actual data block. Any possible corruptions are detected later when the read is attempted by application, in which case read is just silently redirected to main storage.

zfsuser commented 3 years ago

Due to the smaller size of metadata, the same amount of L2ARC space will contain more metadata than data, and by this have a higher hit-probability. Also (if i have not misunderstood the discussion) having data in the (L2)ARC is not really helpful, if the corresponding metadata is not also cached and would need to be read from spinning rust. Getting rid of the separation would result in a simpler code, but metadata would lose its VIP handling, and the users would lose mechanisms to adapt their pool to their needs. In my opinion until somebody performs an in-depth analysis on this topic which undisputable shows the pros of getting rid of the separation outweigh the cons including rewrite of the zfs code with the possibility to introduce errors, the implemented separation of metadata/data caching is clearly worth it.

Interesting, so the persistent L2ARC is only reading back and checking the ARC L2ARC headers, and the L2ARC block are only checked when accessed due to a cache hit.

As we shall verify all data read from a persistent media against their checksum, an implementation of this feature seems to require:

shodanshok commented 3 years ago

FYI, in Solaris 11, the metadata/data separation has been removed entirely. Can we be sure keeping the complexity of separate metadata/data caching is worth the trouble?

I think so: a correct using of the metadata property can make a very big difference when traversing dataset with millions of files. For example, I have a rsnapshot machine were ARC caches both data and metadata, while L2ARC caches metadata only. The performance improvements when iterating over these files (ie: by rsync) over a similarly configured XFS really is massive. Using secondarycache=metadata was a significant improvement over the default secondarycache=all setting.

So I would really like to maintain the data/metadata separation we have now.

devZer0 commented 3 years ago

yes, please !

i think removing the differencing of l2arc metadata/data would be damn stupid.

i have two systems where i'm under pressure now to address the "too many runtime spent in metadata access" problem.

first system is a backup server we use rsync+sanoid zfs rotating snapshots , containing tens of millions of files which rarely change.

we could add special vdev for metadata/smallfiles, but i dislike the idea of buying enterprise class , mirrored ssd for nothing but speeding up metadata access, as we would distribute the backup data to hdd and ssd. i do not want to "stripe" our companies backup to different types of harddisk, depending on each other for proper function. i'm admin for a long time and adding special vdev for backup is causing a hunch of subliminal discomfort. i think it's the wrong way to go. and , even worse, we need to rework the whole pool, as there is no method to push metadata to special vdev afterwards. we would need to take the system out of production for several days for that...

same goes for proxmox backup server, which is similar to borgbackup regarding data storage. proxmox documentation even recommends using ssd for the entire backup pool (doh!) or at least adding special vdev for metadata acceleration. pruning and garbage collection is metadata intensive workload and some "tiny" backup datastore with about 1,5TB of data will not function properly without adding ssd, as runtime for prune and gc already goes through the roof with that "few" data...

i think it's absurd to use special vdev for metadata (i.e. put the original data there) from the perspective, that we have "primarycache=all" & "secondarycache=metadata", which is meant exactly for adressing these types of problems , i.e. speeding up metadata read access by adding el-cheapo consumer grade ssd as a read cache. so, if they die, you use nothing but performance....and they are trivial to replace (no resilver...) - besides the "cache device removal hangs zfs/zpool bug"

i have seen this being discussed often and repeatedly, really curious why it's still not being adressed - see this old discussion for example: https://illumos.topicbox.com/groups/zfs/T8729ed10fa3d42db-Mae35bc26ef8372ad4203ddaf

malventano commented 3 years ago

This may all be tangentially related to an issue I've been tracking where even when parameters are tuned to keep metadata in the arc (not l2arc), metadata continues to be prematurely purged when sufficient data passes through the arc.: https://github.com/openzfs/zfs/issues/10508

devZer0 commented 3 years ago

https://github.com/openzfs/zfs/issues/12028

devZer0 commented 3 years ago

to give another comment on this: i have added l2arc to the 2 systems mentioned above and with secondarycache=metadata runtime for rsync or for proxmox backup server garbage collection or verify has improved considerably since then.

i don't see the point why zfs cache is loosing metadata over and over again instead. it's precious cached data and it should be preferred/preserved.

grahamperrin commented 3 years ago

OpenZFS: All about the cache vdev or L2ARC | Klara Inc. (2020-06-26)

malventano commented 3 years ago

To me this sounds like additional complication with no obvious benefits. ZFS already has small non-evictable metadata cache in RAM for the most important pool metadata. On top of that, normal ARC and L2ARC operation should ensure that (meta-)data accessed at least sometimes should be cached.

The 'obvious benefits' are evident in that TrueNAS specifically sets arc_meta_min to a higher than default (multi-GB) value specifically to try and prioritize metadata preservation in the ARC. This still fails in the face of high data throughput. I'm frankly surprised that you didn't see the benefit here given how significantly it impacts TrueNAS use cases. If a user transfers a few hundred GB of files off of their NAS, they should not then expect a find operation that previously took seconds to now take tens of minutes to complete. That few GB of metadata takes far more time/IOPS to repopulate compared to a bulk sequential transfer - that data has no business displacing the metadata given the lopsided consequences of purging one over the other. So yes, the benefits are clear, now if only the implementation wasn't broken as it is currently.

Ryushin commented 3 years ago

We have a very large 1.6PiB system using 232 hard drives in 21 raidz2 VDEVs with 512GiB of RAM. We have ten 15.4TB NVME drives that are partitioned with five 20GiB mirrors for SLOG and ten 10TiB L2ARC for cache (rest is left empty for garbage collection) giving us 100GiB for SLOG and 100TiB for L2ARC. We change on average about 10TiB of data each day, so the L2ARC can cache about ten days worth of data. Our dataset uses a 1M recordsize. System has two bonded 100Gb Ethernet connections serving dozens of users that are connected via 10Gb.

We have 4.7 million files on this system. With a cold L2ARC, file system traversal takes 65 minutes. After it's cached in ARC, it takes 21 seconds. On a nightly basis, I run a "find /storage > /dev/null" to traverse the entire dataset which takes 54 seconds as it's pulled from L2ARC to put back into ARC. This is really a bandaid as there should be an option to keep metadata from being evicted from L2ARC.

We went with L2ARC devices instead of a special VDEVs as it is more flexible for our needs. First, it does not tie the pool to a specific piece of hardware as we can just move the JBODs to a different server that does not have access to the NVME in case there was some kind of system failure Second, we do have prefetch turned on and have tuned the L2ARC for our needs with amazing results. It is common for us to have 90+% L2ARC hits during the day.

I cannot see a single downside with having ZFS L2ARC prioritize keeping metadata over other data. In most cases, this eliminates the need to have special VDEVs with another potential point of failure. In our case, we were originally looking at using six 15.4TB NVME SSDs in two 3-way mirrors for a specials. What a waste of NVME and they would become immutable to that pool once we did that. Instead, we decided to first test using those six NVME drives for L2ARC and the results were so dramatic, that we upped it to ten drives.

shodanshok commented 3 years ago

@Ryushin can I suggest to try the new l2arc_mfuonly tunable? It should avoid polluting the L2ARC by read-once data.

mqudsi commented 2 years ago

@shodanshok does that also prevent one-time enumeration of metadata from being entered into L2ARC?

shodanshok commented 2 years ago

@mqudsi metadata caching in L2ARC is controller by the secondarycache dataset property. l2arc_mfuonly affect both data and metadata, as it simply set L2ARC to accept evictions from MFU only.

mqudsi commented 2 years ago

Right, so if you have both primarycache=all and secondarycache=all and want to prioritize keeping metadata over data in L2ARC (but still want to try to use the L2ARC for data, just with a lower priority) then in setting l2arc_mfuonly=1 there's a chance that a common workaround like that @Ryushin posted (find /storage > /dev/null) will fail to actually guarantee that after it has run and then there is contention for ARC space, the L2ARC contains the cached metadata for all files, right?