Better NFSD support for btrfs subvolumes

linux-nfs / nfsd

Linux kernel source tree

Other

0 stars 0 forks source link

Better NFSD support for btrfs subvolumes #30

Open chucklever opened 10 months ago

chucklever commented 10 months ago

This was bugzilla.linux-nfs.org 389

[NeilBrown 2022-06-18 01:56:42 UTC] In POSIX a inode is identified by identifying a filesystem, and then an inode within that filesystem using a 64bit inode number. These two pieces of information can be used to determine if two inodes are the same (though not for looking up an inode). The filesystem is identified by a major/minor number pair. This same pair can be used on Linux as an index into /proc/self/mountinfo and other places to find extra information about the filesystem.

btrfs subverts this. Possibly other filesystems such as bcachefs do too. btrfs uses three levels for identification: filesystem, subvolume, inode. This is useful for creating snapshots - only the subvolume number needs to be changed.

In order for POSIX applications to be able to see that files in different snapshots are different, btrfs allocates a new major/minor number for each active subvolume and reports this major/minor to stat() family syscalls. This pretense (that the file is in a different filesystem) is problematic in various ways. For example, the temporary major/minor cannot be found in /proc/self/mountinfo

This all affects NFS export because nfsd does not "see" the fake major/minor and so clients can see multiple different inodes in a filesystem having the same inode number. In particular the root of a subvolume always has inode 256. So if an exported btrfs filesystem has several subvolumes, all will report the same dev/ino to stat(). If one is a parent of another (likely) "find" will detect a loop and refuse to descend.

The only credible fix is to mix the subvolume identifier (64 bits) with the inode number (64 bits) to produce the inode number that nfsd presents to clients (64 bits). Obviously there is no perfect solution. btrfs never reuses inode numbers or subvolume numbers, so while there are likely to be fewer than 64 interesting bits in total, this becomes less likely as filesystems age. It does not appear to be possible to decide to use, for example, 40 bit for inode number, 24 bits for subvol number.

Probably the best solution is to use a strong hash to mix the 128 bits into 64. Collisions are clearly still possible, but extremely unlikely. That contrasts with the current situation where collisions can trivially be demonstrated in ways that have detrimental effects. Note that most collisions would go unnoticed. Comparing inode number is rare. A collision of a parent with a child is particularly problematic and easy to reproduce.

Simply changing nfsd to report a different inode number (a hash) could cause confusion for any client that already had the filesystem mounted. This would be best avoided. To avoid this we could encode in the filehandle (the 4th byte is unused) whether a mixed inode number should be used. Thus established filehandles would not see the inode number suddenly change.

To fix, we need: 1/ an agreement on how to mix the subvol+inode number into 64 bits 2/ a filehandle extension to decide when to use the raw inode number and when to use the mix one.

chucklever commented 10 months ago

[NeilBrown 2022-06-21 04:20:21 UTC] Extra note: An alternate approach for squashing the three levels of identification that btrfs uses down to the two levels that POSIX and NFS use is to merge the subvol identifier into the filesystem identifier. Doing this would still risk false sharing (two different subvols of different filesystems could get the same fsid) but as we have 128 bits, that is substantially less likely.

One awkwardness with doing this is NFSv4 needs to report a mounted-on fileid for any mount point, and the root of a btrfs subvol would look like a mount point. But there is no mounted-on file in the btrfs filesystem to provide a mounted-on fileid. A reasonable solution is to provide a "fake" file-id which is otherwise unused by btrfs, such as 255. All mountpoints would have to same mounted-on fileid, but these are almost invisible in practice and I don't think it would actually matter.

A more significant problem is that the subvols would pollute the the mount table in the client. For a local btrfs filesystem, the subvols do NOT appear in the mount table (unless they are explicitly mounted). This is intentional and I received push-back when I suggested it. Partly this is because there can be a large number of subvols. Partly it was because path names to subvols may be private and placing them in the mount table makes them public. There might be other reasons.

subvols exported from btrfs would only appear in the mount table on the client while they are actually being accessed, and for a few minutes afterwards. This suggests that the large number of subvols might not end up being excessive on the client because most of them would not be active/visible. But there might be use-cases where lots of subvols could appear on the client. If a private subvol were accessed over NFS, the name would be publicly visible while it was being accessed. It is hard to know if this is actually a problem in practice, but it might be.

So the problems of exposing subvols to NFS as separate filesystems are different sorts of problems that than the problems associated with keeping all subvols in a single filesystem. There is much smaller risk of a correctness problem, but an unknowable risk for exposing private data or overwhelming the mount table.

chucklever commented 10 months ago

[NeilBrown 2022-06-21 04:37:18 UTC] Second extra note: The "ideal" solution, in my mind, is for btrfs to limit the number of inodes per subvol to 2^40, and limit the number of subvols to 2^24. It could then provide a guaranteed unique fileid to NFS by simply catenating these two numbers.

I think these sizes are large enough to not be a problem in practice (though possibly 38/26 would be safer). The difficulty is that btrfs never reuses inode numbers of subvol numbers. Re-using inode numbers would require keeping track of at least some of the unused numbers in some structure that was reasonably efficient to access. There would be some performance cost in this, but I believe that a modest cost could be amortized over a large number of inode allocations making the average cost minimal.

Re-using subvol numbers is more awkward. If I understand correctly, subvol numbers are assumed to be monotonically increasing. This allows a "first" and "last" subvol to be recorded for when some block is "live" so that it is clear the block is not part of any subvol outside of that range.

I suspect this is solvable. The internal subvol numbers could combine 24 bits that are externally visible and reused as needed with 40 bits that are internal and monotonic. The 40 bit number could be used for the range checking. Creating 30 subvols a second would take 1000 years to exhaust the 40 bits. However I don't know all the details of btrfs internals so I cannot be sure of this.

The summary is that I think it would be possible to change btrfs to "do the right thing". It would not be easy and convincing the btrfs developers to do the work (or even accept the maintenance burden if someone else did the work) would probably require more than a promise that NFS export would be a tiny bit more reliable.

chucklever commented 10 months ago

[Jeff Layton 2022-10-24 13:13:12 UTC]

If a private subvol were accessed over NFS, the name would be publicly visible while it was being accessed. It is hard to know if this is actually a problem in practice, but it might be.

How would this visible on the NFS client? If the server is just generating a fileid and fsid for each inode, how would the NFS client view the private subvol name?

Personally, I lean toward just doing solution #2 and treating each subvolume as a separate filesystem in NFS. The NFS spec is a bit vague as to what constitutes a filesystem, but I one primary rule is that you shouldn't have inode number collisions within one. If btrfs is violating that across subvolumes, then I don't see how we can do anything other than treat them as different filesystems.

A client would need to touch a ton of subvolumes in order to blow out the mount table. That's already an issue on local machines anyway, most likely, esp if they are synthesizing different a different st_dev for each subvol.

In any case, such configurations should hopefully be outliers, and admins should be able to take precautions by not exporting a tree that allows access to too many subvols.

The mounted-on-fileid thing is a bit of a wrinkle though -- having a dummy inode (or set of them) for this seems like a plausible solution, though we'll need to discuss that with the btrfs folks.

If we can talk them into limiting their inode number space per subvolume though, then that would make things much simpler. I imagine that NFS is not the only thing confused by btrfs's behavior too, so it might be interesting to consider what applications may have had issues on btrfs due to this, and how they addressed them.

chucklever commented 10 months ago

[NeilBrown 2022-10-24 22:48:04 UTC]

How would this visible on the NFS client? If the server is just generating a fileid and fsid for each inode, how would the NFS client view the private subvol name?

Whenever the Linux NFS client sees a new fsid it create a new mount (much like an autofs auto-mount point). This appears in /proc/mounts and so the full path is publicly visible. The mount will disappear after it has been unused for some minutes (15?).

We could possibly hack /proc/mounts so that some details are blurred-out for non-privileged users

Having a large number of btrfs mounts clutter the client-side mount table is probably not a big problem. I think most applications which access a lot of snapshots/subvols want other btrfs functionality and so don't make sense over NFS. At worst we might need to trigger an earlier timeout on nfs automounts when there are lots of them.

If we can talk them into limiting their inode number space per subvolume though, then that would make things much simpler. I imagine that NFS is not the only thing confused by btrfs's behavior too, so it might be interesting to consider what applications may have had issues on btrfs due to this, and how they addressed them.

I suspect that convincing the btrfs to use fewer than 64 bits is not worth the effort - it almost certainly won't work.

There certainly other tools confused by btrfs behaviour. One example that I know of involves the fake st_dev that it creates for subvols. This fake number appear in stat-family results but not anywhere else. There are multiple other places that report the device number, and these are inconsistent for btrfs subvols. See these two SUSE patches for some details.

https://github.com/openSUSE/kernel-source/blob/master/patches.suse/btrfs-provide-super_operations-get_inode_dev https://github.com/openSUSE/kernel-source/blob/master/patches.suse/vfs-add-super_operations-get_inode_dev

This results in occasional problems for audit, trace-events, event-poll ....

A "proper" fix for this requires a separate vfsmount for each active btrfs subvol. Some people claim that have extremely large numbers of such subvols concurrently active. Maybe /proc/mounts just doesn't scale any more and we need to discard it?

chucklever commented 10 months ago

[Jeff Layton 2022-10-26 10:13:45 UTC]

How would this visible on the NFS client? If the server is just generating a fileid and fsid for each inode, how would the NFS client view the private subvol name?

Whenever the Linux NFS client sees a new fsid it create a new mount (much like an autofs auto-mount point). This appears in /proc/mounts and so the full path is publicly visible. The mount will disappear after it has been unused for some minutes (15?).

We could possibly hack /proc/mounts so that some details are blurred-out for non-privileged users

Why would we bother? Isn't this info "public"? If the subvolume is reachable by path, and that path is exported, then why would we consider that path "sensitive"?

Having a large number of btrfs mounts clutter the client-side mount table is probably not a big problem. I think most applications which access a lot of snapshots/subvols want other btrfs functionality and so don't make sense over NFS. At worst we might need to trigger an earlier timeout on nfs automounts when there are lots of them.

If we can talk them into limiting their inode number space per subvolume though, then that would make things much simpler. I imagine that NFS is not the only thing confused by btrfs's behavior too, so it might be interesting to consider what applications may have had issues on btrfs due to this, and how they addressed them.

I suspect that convincing the btrfs to use fewer than 64 bits is not worth the effort - it almost certainly won't work.

There certainly other tools confused by btrfs behaviour. One example that I know of involves the fake st_dev that it creates for subvols. This fake number appear in stat-family results but not anywhere else. There are multiple other places that report the device number, and these are inconsistent for btrfs subvols. See these two SUSE patches for some details.

https://github.com/openSUSE/kernel-source/blob/master/patches.suse/btrfs-provide-super_operations-get_inode_dev https://github.com/openSUSE/kernel-source/blob/master/patches.suse/vfs-add-super_operations-get_inode_dev

This results in occasional problems for audit, trace-events, event-poll ....

A "proper" fix for this requires a separate vfsmount for each active btrfs subvol. Some people claim that have extremely large numbers of such subvols concurrently active. Maybe /proc/mounts just doesn't scale any more and we need to discard it?

It has always been possible to create tons of mountpoints, sometimes even by doing things that seem innocuous (particularly with NFS). The admin is in charge of what gets exported on the server. If the concern is too many mountpoints, then they can limit it to just exporing some subvolumes, or take care on the client not to traverse too many mountpoints at once.

chucklever commented 10 months ago

[NeilBrown 2022-10-26 22:01:40 UTC]

Why would we bother? Isn't this info "public"? If the subvolume is reachable by path, and that path is exported, then why would we consider that path "sensitive"?

In btrfs, private users can create their own private subvolumes (or some I'm told) and can choose whatever name they like. If the parent isn't publicly readable, then the name is private. You wouldn't want private data (or names) to become public just because you accessed them via NFS - would you?

https://lore.kernel.org/linux-nfs/20210729023751.GL10170@hungrycats.org/

It has always been possible to create tons of mountpoints, sometimes even by doing things that seem innocuous (particularly with NFS). The admin is in charge of what gets exported on the server. If the concern is too many mountpoints, then they can limit it to just exporing some subvolumes, or take care on the client not to traverse too many mountpoints at once.

The admin is not in a position to "take care on the client". Private users must do that.

And the admin is not completely in charge of what gets exported on the server. Exporting a btrfs filesystem unavoidably exports all subvolumes reachable from the export-point. Requiring each individual subvol to be explicitly exported is seen as unacceptable. (I can't immediately find the email where someone said it would break their use case, but I'm certain that I have one).

And there have certainly been real performance problems with enormous /proc/mounts. Maybe they have been fixed, and maybe we shouldn't care. But if so, that should be a conscious choice.