openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.64k stars 1.75k forks source link

Neither security xattrs nor posix ACL xattrs are atomically managed #2718

Open dweeezil opened 10 years ago

dweeezil commented 10 years ago

As alluded to in #2445, neither the "security" xattr nor the xattr related to a posix ACL are managed atomically in the ZPL with respect to the object to which they are attached. This can lead to various weird situations such as files or directories being created without their related xattr.

A related, and as far as I'm aware currently unreported problem is that these xattrs are not managed at all during ZIL replay (mkdir, for example). [EDIT: see next comment]

Both these issues are specific to zfsonlinux; other OpenZFS implementations' host operating systems don't have an analogous feature.

I have a WIP solution which is to push these operations down into the ZFS/DMU layer so they can happen as part of the same transaction as the related ZPL operation. The complication, unfortunately, is that the the ZFS layer has no business touching ZPL-related structures nor does it even have access to same. My proposed solution is to add callbacks to some of the "zfs_..." functions to allow for in-transaction post-processing. Of course, this causes us to diverge from the upstream code.

I'm posting this issue now in light of the possibly tangentially-related issues #2717, #2663, #2700 and others which I can't remember at this time. Those issues all are tantalizingly related to the management of a dnode's associated xattrs. There's a good chance those issues aren't related at all to the problem I've outlined above but I wanted to mention them in any case.

dweeezil commented 10 years ago

Regarding the issue of ZIL replay, this shouldn't be an issue with the current design because the xattr-creating functions should create their own replay entries. In other words, this shouldn't actually be a problem as the code sits right now. To that end, I've updated the title of this issue to remove that part.

behlendorf commented 10 years ago

Thanks for filing this. It's been something of a subtle long standing issue I'd love to see resolved. I think your approach off adding a callback here to deal with the layering issues is probably the only sane way to handle this. Whenever you're ready I'll be interested to see the patch.

dweeezil commented 10 years ago

An update on this issue is in order. This issue could easily have been titled "Linux Posix layer requires additional support from the 'native' Posix layer". The problem at hand is that some operations in the Linux Posix layer perform multiple calls to the "Native" ZFS Posix layer which should occur in a single DMU transaction but which don't because the transactional encapsulation occurs within the Native Posix layer.

My initial proposition of a simple vnop-to-zpl callback scheme doesn't help matters much because the support doesn't exist for a vnop function (i.e. zfs_create) to perform xattr creation within its transaction.

I think this issue exposes one of many fundamental differences between Solaris and Linux: In Solaris, xattrs are mainly a userland facility wheres Linux has several facilities (selinux and Posix ACLs) which ascribe kernel-level semantics to them (xattrs).

Clearly we (Zfsonlinux) want to support both Selinux and Posix ACLs as well as possible; they're both quite important in the Linux ecosystem. We need ZFS to be able to add xattrs to a newly-created object as atomically as possible with respect to the creation of the object. To that end, my current WIP patch adds a new suite of xattr-creating functions which mostly parallel their corresponding functions in zpl_xattr.c. In addition, some of the existing "vnop helper" functions such as zfs_make_xattrdir are being modified to optionally perform their operation in the context of an existing transaction. This is all greatly complicated by the need to support both directory-style and SA xattrs.

My hope was that I could come up with something rather quickly and that it would expose a race condition or something similar which is causing the various cases of corrupted dnodes which seems to be occurring with increasing regularity when SA xattrs are being used to support Posix ACLs. Unfortunately, fixing this the right was will be neither quick nor easy.

EDIT: See my next comment.

dweeezil commented 10 years ago

I had an idea which came to me after thinking about what a mess it will be to implement the scheme I outlined in my previous comment: Rather than storing these "special" xattrs in the traditional manner, how about leaning on the SA facility and creating new system attributes for them. I'd propose creating a new SA for each of "system.posix_acl_access", "system.posix_acl_default", "security.selinux" and "security.capability".

@behlendorf What are your thoughts on this idea? I'm asking because I don't want to plow ahead and implement something along these lines if it's a show-stopper with no chance of being adopted.

I'd suggest this feature be controlled by a whole new filesystem property or, maybe a feature flag. If it were to be implemented, we'd likely want to suggest that xattr=on be used in order that other xattrs not bloat the available SA space.

Some of the advantages to the implementation are:

The disadvantages seem to be similar to those related to xattr=sa:

These new system attributes would be manipulated in the vnop functions with the existing SA interface and would be tied into the xattr mechanism via the xattr "handler" mechanism.

An extension to this concept would be to store all "security." xattrs either as a single SA or as a series of SAs with names based on the attribute's suffix. This would allow to support SMACK and other security features but (at least in the case of packing all in a single SA) would require additional code to manage the packed value (likely stored as an nvlist).

behlendorf commented 10 years ago

@dweeezil I've had similar thoughts about this. What you're proposing is actually more consistent with how SAs were originally designed to be used. It's just a little less flexible since you're tied in to those specific SA names at build time and then we're committed to them forever. It also requires that we add custom code to bypass the normal xattr call paths for these special xattr keys.

That said, I think this is still a reasonable approach to pursue for exactly the reasons you mentioned. Most importantly it keeps the interface as clean and allows us to ensure everything ends up in the same transaction. As for the specific downsides:

Not compatible with other OpenZFS implementations

This is a valid concern but I don't think it should prevent us from doing this. The worst case here is that we have a few SAs which are unknown on the other ports. This is no worse that the current xattr=SA interface today. Plus we can add the appropriate feature flags for compatibility purposes. This would also be a great opportunity to do the same for the existing xattr=SA interface.

Limited space available to store the attributes.

Very little. I believe that's only around 100 bytes left in the dnode. Perhaps not even that much. If at all possible we should ensure the common case for SELinux contexts do not force us to kick in a spill block. That would be very bad for performance. That said, we should be adding support for dnodes larger than 512 in the next 6-12 months. Perhaps sooner. When this happens it would be reasonable to update the default dnode size to 1k which would be large enough for everything we need and a few user xattrs.

No clean upgrade path for existing filesystems.

This could be done cleanly if we add a feature flag for this. We should be able to convert either directory or SA xattrs with these keys to the new format as they are encountered. This sort of thing has been done before in the code. For example, old style znodes (v4) are converted to the newer SA format (v5).

Confusion if the attribute already exists, stored in the traditional manner.

All the more reasons to force the conversion above. Frankly this would also provide an opportunity to simplify the xattr=SA semantics. The current behavior was designed to be flexible but it leads to confusion which may cause more harm than good.

Additional logic required when enumerating xattrs.

This is unfortunate, but I think it's just a cost we're going to need to pay.

An extension to this concept would be to store all "security." xattrs either as a single SA or as a series of SAs with names based on the attribute's suffix.

How would this be so different that what we're doing today? We have an exist SA which stores all of the key/value pairs in an nvlist. We just happen to access them through the xattr interfaces. There's nothing technically preventing us from bypassing those interfaces, directly manipulating the nvlist, and packing it in to the TX. The only wrinkle is getting the locking for this correct. I could see how adding a new "security" and probably "system" nvlist namespace would simplify that.

And this is all a long way of saying I think this is a pretty reasonable plan to explore. There might be gotchas discovered along the way but in principle this is a reasonable approach.

dweeezil commented 10 years ago

@behlendorf It sounds like we're on the same page with this. I did actually go ahead and start implementing it this afternoon and am close to having a working patch for review to deal with the xattr-as-SA part of it. Once that's ironed out, however, fixing the problem for which this issue was created should be fairly trivial.

aarcane commented 10 years ago

For this issue and another similar issue I recall seeing, the ideal behavior seems to be to read in legacy data and honor it, but to replace it with current gen data and remove old data whenever an appropriate write action is occurring On Oct 13, 2014 5:43 PM, "Tim Chase" notifications@github.com wrote:

@behlendorf https://github.com/behlendorf It sounds like we're on the same page with this. I did actually go ahead and start implementing it this afternoon and am close to having a working patch for review to deal with the xattr-as-SA part of it. Once that's ironed out, however, fixing the problem for which this issue was created should be fairly trivial.

— Reply to this email directly or view it on GitHub https://github.com/zfsonlinux/zfs/issues/2718#issuecomment-58975430.

dweeezil commented 10 years ago

I'm very close to submitting a pull request for the "native SA xattr" feature which is the foundation upon which a fix for this issue will be based. Here's a preview of a new file creation on a ZFS file system with selinux enabled (ubuntu trusty with the "ubuntu" selinux policy modified to support zfs):

# echo test > test
# setfacl -m u:1234:rw test
# setfattr -n user.test -v test test
# sync
# ls -i test
7 test
# zdb -dddd tank/fs 7
Dataset tank/fs [ZPL], ID 41,...

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         7    1    16K    512    512    512  100.00  ZFS plain file
                                        312   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /test
...
    pflags  40800000004
    SA xattrs: 56 bytes, 1 entries
        user.test = test
    Native SA xattrs:
        security.selinux = system_u:object_r:file_t:s0\000
        system.posix_acl_access = \002\000\000\000\001\000\006...

I've modified zdb to display the native SA xattrs following the traditional SPL_DXATTR xattrs.

And the SA attribution registrations and layouts"

# zdb -dddd tank/fs 5 6
Dataset tank/fs [ZPL], ID 41...

        ZPL_PARENT =  8000007 : [8:0:7]
...
        ZPL_SECURITY_SELINUX =  30015 : [0:3:21]
...
        ZPL_SECURITY_CAPABILITY =  30016 : [0:3:22]
...
        ZPL_SYSTEM_POSIX_ACL_ACCESS =  30017 : [0:3:23]
        ZPL_DXATTR =  30014 : [0:3:20]
...
        ZPL_SYSTEM_POSIX_ACL_DEFAULT =  30018 : [0:3:24]
...

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         6    1    16K    16K  7.00K    32K  100.00  SA attr layouts
...
        5 = [ 20  23 ]
        13 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  21  23 ]
        14 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  21  23  20 ]
        2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]
        8 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  24 ]
        6 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  17 ]
        10 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  23  24 ]
        9 = [ 24  23 ]
        3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  20 ]
        12 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  21 ]
        4 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  23 ]
        11 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  24  23 ]
        7 = [ 17 ]

I've run the Posix ACL test suite on this file system and done a bunch of other tests so it has built up a bunch of interesting SA layouts.

The main issue I'm wrestling with is how to deal with existing SPL_DXATTR-style xattrs. I'd really like to wipe them but it will add a lot of code. Ideally, I'd like to purge the entire SPL_DXATTR SA if it ever becomes empty (which is something the current code doesn't do). I'm not thrilled with the potential overhead of this type of cleanup. The new code does cause the new native SA xattrs to have precedence over the DXATTR versions similarly to the way the existing xattr=sa code gives them precedence over the directory-style xattrs.

dweeezil commented 10 years ago

Refreshed as dweeezil/zfs@533cf28. The zfs_sa_native_get_xattr() function now uses error returns consistent with the other zfs_ functions. It uses an in/out pass-by-reference parameter to communicate the xattr's size. Other changes as recommended in the code comments.

I have not yet added a/any feature flag(s) to accommodate this facility or the existing xattr=sa facility.

dweeezil commented 10 years ago

As part of my work on this this in #2809, I realized the title of this issue could really use the word "managed" rather than "created". The same problem also exists in the zpl_setattr() path in which the posix ACL operations are not performed atomically with respect to the underlying zfs_setattr() operation. Unfortunately the posix ACL operations are much more difficult to push down a layer versus the selinux operations because the former require much of the work to be performed within the file system whereas the latter is mostly handled by the kernel.

dweeezil commented 5 years ago

@kpande The closest thing we have to a fix for this problem is 214806c which should avoid the problem, however, I'm not sure it's a complete fix. I think something like 533cf287c8 is a better long-term solution, but it's a major on-disk format change and the only way something like it would likely ever be adopted would be to use a feature flag to indicate its use and then to have a background task which would "upgrade" existing "xattr=sa" style xattrs to the new format (similar to enabling userobj_accounting).

This is an area that will likely become of more interest if the ZoL code base actually becomes canonical (insofar as OpenZFS is concerned).

stale[bot] commented 4 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.