opencontainers / runtime-spec

OCI Runtime Specification
http://www.opencontainers.org
Apache License 2.0
3.24k stars 551 forks source link

bundle: filesystem metadata format #11

Closed philips closed 8 years ago

philips commented 9 years ago

@stevvooe and I caught-up in person about our digest discussion and the need for serialize file-system metadata. If you want to read my attempt it is found here: https://github.com/opencontainers/specs/issues/5#issuecomment-114208979

Problem: a rootfs for a container bundle sitting on-disk may not reflect the exact intended state of the bundle when it was copied to its current location. Possible causes might include: running on filesystems with varying levels of metadata support (nfs w/o xattrs), accidental property changes (chown -R), or purposeful changes (xattrs added to enforce local policies).

Obviously the files contents will be identical so that isn't a concern.

Solution: If we hope to create a stable digest of the bundle in the face of these likely scenarios we should store the intended filesystem metadata into a file itself. This can be done in a variety of ways and this issue is a place to discuss pros/cons. As a piece of prior-art @vbatts has implemented https://github.com/vbatts/tar-split and we have the linux package managers with tools to verify and restore filesystem metadata from a database with rpm -a --setperms and rpm -V.

stevvooe commented 9 years ago

After some thought, we need a format that does the following:

  1. A manifest enumerates files in a bundle.
    1. Provides a content hash.
    2. Provides a path.
    3. Provides a file type.
    4. Provides standard file mode.
    5. Provides xattr.
    6. Provides an extension mechanism.
      1. Geared towards multiple OS support.
      2. Not infinitely extendable but should be easy to add new field.
  2. Bundle contents attributes can be reset to contents of file manifest.
    1. Bundle is scanned and any differences from manifest are rectified.
    2. Unames/Gnames/Uid/Gid can be mapped during "reset".
  3. Bundle contents can be verified against manifest.
    1. Content hash can be checked.
    2. Attributes can be checked.
      1. Certain attributes can be checked against machine-local mapping (uid/gid, etc.).
    3. Manifest can optionally be signed.

Meeting the above three use cases with this format puts us above the test for Cost problems with the ideas from #5. Requirement 2 above has a common use case, by avoiding placing unreasonable requirements on transports. Requirement 3 above gives us the functionality of #5 with the extra benefits of Requirement 2.

@shykes @philips @crosbymichael

philips commented 9 years ago

@stevvooe I am having a hard time parsing this sentence: "Meeting the above three use cases with this format puts us above the test for Cost problems with the ideas from #5."

I agree with all three needs overall despite my confusion above.

stevvooe commented 9 years ago

@philips That is a poor sentence where I've shoved in a lot of meaning.

What I'm saying is that, given the goals of #5 (cryptographic verification), the cost of scanning a bundle is not warranted. Given these new goals, the cost is warranted and it makes bundles portable over different transports. Basically, we have a solid reason for filesystem scanning that could also be used as a signable target.

philips commented 9 years ago

@stevvooe Ack. So the next step is a .proto file?

stevvooe commented 9 years ago

No better way to get started than with a straw man:

syntax = "proto3";

package ocf.bundle;

// BundleManifest specifies the entries in a container bundle, keyed and
// sorted by path.
message BundleManifest {

    message Entry {
        // path specifies the path from the bundle root
        string path = 1;

        // NOTE(stevvooe): Need to define clear precedence for user/group/uid/gid precedence.

        string user = 2;
        string group = 3;

        uint32 uid = 4;
        uint32 gid = 5;

        // mode defines the file mode and permissions. We've used the same
        // bit-packing from Go's os package,
        // http://golang.org/pkg/os/#FileMode, since they've done the work of
        // creating a cross-platform layout.
        uint32 mode = 6;

        // NOTE(stevvooe): Beyond here, we start defining type specific fields.

        // digest specifies the content digest of the target file. Only valid for
        // regular files. The strings are formatted as <alg>:<digest hex bytes>.
        // The digests are added in order of precedence favored by the 
        // generating party.
        repeated string digest = 7;

        // target defines the target of a hard or soft link, relative to the
        // bundle root.
        string target = 8;

        // specifies major and minor device numbers for charactor and block devices.
        string major = 9;
        string minor = 10;

        message XAttr {
            string name = 1;
            string value = 2;
        }

        // xattr provides storage for extended attributes for the target resource.
        repeated XAttr xattr = 11;

        // AlternateDataStream represents NTFS Alternate Data Streams for 
        // the targeted resource.
        message AlternateDataStream {
            string name = 1;
            bytes value = 2;
        }

        // ads stores one or more alternate data streams for the given resource.
        repeated AlternateDataStream ads = 12;
    }

    repeated Entry entries = 1;
}

Changes:

philips commented 9 years ago

@stevvooe looks pretty good. Two questions:

stevvooe commented 9 years ago

What is an ADS?

This is NTFS equivalent of extended attributes (sort of), known as "Alternate Data Streams". The semantics are slightly different, so I've pulled it out into a separate type. Notice the use of type bytes for the value, instead of string. I'd like to get some feedback from a Windows expert to see if this is sufficient.

Should digest be repeated so we can deprecate old hashes and upgrade to new ones over time?

In this case, I don't see why not.

I've updated the comment in-line.

stevvooe commented 9 years ago

@philips We may also want to define an exclusion operator to the manifest specification, since it operates at the bundle level.

enum Op {
    // EXCLUDE specifies that the matched files should be explicitly exlcuded
    // from the manifest. They may be still part of the bundle.
    EXCLUDE = 0;

    // INCLUDE specifies that the resource should be include in the manifest.
    // This has the act of "pinning" the resource. For example, if the resouce
    // is later matched by an exclude statement, it will still be included.
    INCLUDE = 1;
}

message PathSpec {
    Op operation = 1 [default=EXCLUDE];

    // path specifies a path relative to the bundle root. If the path is a
    // directory, the entire tree will be excluded.
    string path = 2;

    // pattern specifies a glob to match resources and apply the operation.
    string pattern = 3;
}

// path specifies the bundle paths covered by the manifest. Specifications are
// order by precedence. For a given path, only the first matching
// specification applies. During processing, it is important to fully process
// all resources, even if a directory is excluded, since child resources may
// first match an inclusion.
repeated PathSpec pathSpec;

Benefits:

Cost:

Another possibility is to allow this to be specified on the command line when first building the manifest. That doesn't allow us to catch "extra" files, but that may not be that important and likely doesn't warrant the extra complexity.

bitshark commented 9 years ago

Okay, so after reading this, I think this makes a bit more sense to me . . . I think the area where I'm having trouble is understanding the scope vis a vis the goals...

I think what I've read so far is well thought out and reasoned, so props to everyone , heh. Forgive me ahead of time, but I wanted to share some thoughts. Discard if you don't think they are useful.

There seems a basic consensus -- in the limited info I've read on the container crypto goals --that everyone would probably be on board with the following ideas /in principle/ , certainly as options... These are labeled as the 'Basic Themes' below .. areas where there's agreement:

Basic Themes

Just as general thought, first I'm going to put these here just as problems I have run into myself with design of cryptography... You guys may know this already but writing this down helps me organize my thoughts.

I'll be back with specifics tomorrow or Saturday.

Statements in general I've found true in crypto engineering. May be useful in this context.

- Define our threat model before we do anything else. Who are we trying to protect against?

What are we trying to protect (data, secrets, access, etc)? What is our tolerance in terms of a threat vs it's complexity/likelyhood?

I've got a series of specific questions in regards to the proposed standards here (which I think are pretty awesome )... but I'm tired at the moment -- I'll post the specific questions / comments tomorrow: I hope this helps in the time being . . . This is my take on how to think about crypto engineering.

General Questions:

Specific Questions (assuming key distribution is solved, heh... like with OpenPGP):

BTW these questions arent meant to all have answers obviously. They are more like engineering food .

Anyway -- thanks, good work, and good luck gentleman. Looking forward to writing about the details tomorrow if time permits, as you all have some really good ideas here, and I'm sure you'll sort this all out.

vbatts commented 9 years ago

this fileset would be a binary packed format, like CrAU?

stevvooe commented 9 years ago

@vbatts https://github.com/stevvooe/continuity has been opened up to continue this research.

duglin commented 8 years ago

Probably related to #302. Will need to be considered as part of that.

cgwalters commented 8 years ago

I think the https://github.com/GNOME/ostree format has a lot of advantages. It was designed from the start to be checksummed. If implementing anything else, at least study it.

cgwalters commented 8 years ago

For example:

stevvooe commented 8 years ago

@cgwalters We have been researching a number of approaches while working on https://github.com/stevvooe/continuity. The big difference is that continuity does not prescribe a distribution format while keep metadata transport consistent.

It intentionally does not include device files, because why would you have devices in container images?

We've found that having an opinion here will work the system into odd chicken-egg problems. For example, if we rely on runc to create a device, how do we specify the ownership parameters in the archive format? We'd have to call into runc to create the devices, then call back out to the archiver to apply the metadata, then back into runc for runtime.

There are also other filesystem objects, such as sockets and named pipes, that may need to be serialized when migrating a process.

It doesn't include timestamps per file, because immutable containers don't need them.

We've gone back and forth on this requirement. The main issue here is that if you want stable regeneration, you cannot have timestamps in the metadata. However, let's say you want to pick a compilation process mid build and then resume it on another node. Modification times are very important here. When you start examining this, there are a number of applications that would behave in odd manners when all of the timestamps are from the extraction time.

Mostly, we can obviate this need by not trying to regenerate an expanded artifact. IMHO, it imposes challenging requirements on the transport format that don't ultimately serve the user while introducing security problems in the pursuit of hash stability (see: tarsum).

(And if you do need timestamps, just do what git does and derive them from the commit object timestamp).

Interesting. I did not know this. Very cool!

xattrs are part of the per-file checksum (Although I think container images shouldn't include xattrs, we should drop setuid binaries and file caps for more secure containers)

We have this in continuity to some degree. There are lots of applications that cannot work correctly without xattrs, in addition to setups that require setuid.

In the past few weeks of development and experimentation, we've actually found the right model is to have continuity collect as much information as possible, then provide tools to selectively apply metadata and verify the on disk data.

cgwalters commented 8 years ago

On Fri, Feb 12, 2016, at 03:00 PM, Stephen Day wrote:

We've found that having an opinion here will work the system into odd chicken-egg problems. For example, if we rely on runc to create a device, how do we specify the ownership parameters in the archive format? We'd have to call into runc to create the devices, then call back out to the archiver to apply the metadata, then back into runc for runtime.

Any non-privileged container should only see the "API" devices (/dev/null etc.)  Any privileged container is, well, privileged and can create the device nodes itself.  Why would you ship pre-created device nodes in an in an image?

There are also other filesystem objects, such as sockets and named pipes, that may need to be serialized when migrating a process.

Migration is data, not images.  Use tar or whatever for that.  And data should be cleanly separated in storage from the image.

It doesn't include timestamps per file, because immutable containers don't need them.

We've gone back and forth on this requirement. The main issue here is that if you want stable regeneration, you cannot have timestamps in the metadata. However, let's say you want to pick a compilation process mid build and then resume it on another node. Modification times are very important here. When you start examining this, there are a number of applications that would behave in odd manners when all of the timestamps are from the extraction time.

Again, that's a data case, not immutable images.  I think using container images as a backup format doesn't make sense.  A vast amount of backup software already exists.  Yes, one needs to cleanly separate code from data, but that's a fundamental requirement for upgrades anyways.

stevvooe commented 8 years ago

@cgwalters I am not sure if you saw it, but I made the following point at the bottom of my comment:

we've actually found the right model is to have continuity collect as much information as possible, then provide tools to selectively apply metadata and verify the on disk data.

This approach is compatible with all of the points identified, while not limiting the capability of containers.

In general, images are data, as well. Indeed, a large number of backup software and distribution channels for filesystem images do already exist. Why not make an archive format that is compatible with all of them? Conversely, why require a backup solution in addition to the ability to snapshot and archive containers? Both are acceptable use cases at either end of a continuum. It would be unfortunate to disallow one based on an arbitrary opinion, even if well-grounded.

Ultimately, deciding what a container or image archive can and cannot do just isn't productive. Shipping metadata is inexpensive and the user can always choose to unpack them or not.

cgwalters commented 8 years ago

In one view, sure it's all "just files". But I think there's a strong argument to have separate tools and data formats for different problem domains (source code, binaries, database backups) that share ideas rather than trying to do one format for everything. git is already good for source code and text, etc.

Don't underestimate the cost of inventing a new file format for things like mirroring, versioning, language bindings for parsers, etc.

cgwalters commented 8 years ago

Going back to the top of the motivation here:

Problem: a rootfs for a container bundle sitting on-disk may not reflect the exact intended state of the bundle when it was copied to its current location.

I'd say the correct solution here is for the container runtime to work with the storage layer to ensure immutability. See http://www.spinics.net/lists/linux-fsdevel/msg75085.html for a proposal there. It'd require plumbing through from the filesystem to the block level, but I think the end result would be simply better than classic tools like tripwire and IMA, as well as whatever verification is invented here. (Yes, that proposal doesn't cover xattrs, we'd want a way to freeze specific xattrs too likely)

stevvooe commented 8 years ago

@cgwalters Is there a windows port for OSTree?

wking commented 8 years ago

On Thu, Feb 25, 2016 at 01:30:26PM -0800, Colin Walters wrote:

Going back to the top of the motivation here:

Problem: a rootfs for a container bundle sitting on-disk may not reflect the exact intended state of the bundle when it was copied to its current location.

I'd say the correct solution here is for the container runtime to work with the storage layer to ensure immutability.

The O_OBJECT proposal you link is about preserving filesystem content after it lands on the filesystem, but it looks like @philips initial concern was about landing it on the filesystem in the first place. For example, “my FAT-16 filesystem doesn't support POSIX permissions, so my rootfs/foo/bar seems to have 0777 instead of the source's 0600”.

cgwalters commented 8 years ago

I struggle to understand a scenario where one would reasonably want to unpack container content onto FAT-16 and expect to run it. Maybe inspection, but even then, you can do that from userspace easily enough with libarchive or whatever. If you have a Linux container, you have Linux...hence you have xfs/ext4/etc.

wking commented 8 years ago

On Thu, Feb 25, 2016 at 03:28:36PM -0800, Colin Walters wrote:

I struggle to understand a scenario where one would reasonably want to unpack container content onto FAT-16…

A poor choice of example, but @philips was pointing out that not all filesystems support the same attributes (he pointed out NFS without xattrs, among other things). Regardless of the specific examples, unpacking into a local filesystem (what @philips was talking about 1) and maintaining content after that unpacking (what you were talking about 2) are two separate things.

cgwalters commented 8 years ago

Anyways my goal here is to try to ensure sharing of ideas, not necessarily code in this area - OSTree is certainly not going to take over the world as a way to get content from A to B any more than other projects in this area. Another good project to look at is Clear Linux: https://lists.clearlinux.org/pipermail/dev/2016-January/000159.html

A good example of a mistake in OSTree - I've come to realize the git-like Merkle tree model was a mistake for binary content, because it's really common with software updates for one "package" to change multiple paths (due to /usr/bin and /usr/share etc.) For git and source code it's a lot more common to only change one subdirectory.

So the Clear Linux manifest makes sense - there's no streaming, but that's fine because we aren't storing huge amounts of content to tape drives.

Also, OSTree not including the size in the tree metadata was really dumb but that's papered over with static deltas.

Speaking of deltas...that's another area where Docker really lacks, and for OSTree I ended up taking a ton of inspiration from http://dev.chromium.org/chromium-os/chromiumos-design-docs/filesystem-autoupdate. For more on that see https://ostree.readthedocs.org/en/latest/manual/formats/

cgwalters commented 8 years ago

Regarding NFS...sure, but how does it help a user/admin to determine after the fact that things are broken? Basically the system is either going to munch the fscaps on /bin/ping or not, a system that tells you "hey the fscaps are missing" may lead you Google faster but that's about it...

fscap binaries can be pretty easily worked around in an NFS root scenario by copying them into tmpfs or something on boot. Yes, it's ugly, see: https://bugzilla.redhat.com/show_bug.cgi?id=648654#c19

cgwalters commented 8 years ago

Also, I'd like to go on a crusade to kill off setuid binaries in containers - they're legacy, and in a container world we should always run with NO_NEW_PRIVS on. Use containers as a reason to leave behind the continual security issues of setuid, and just have them on the host until someone rewrites PAM and /sbin/unix_chkpwd etc.

wking commented 8 years ago

On Thu, Feb 25, 2016 at 06:31:09PM -0800, Colin Walters wrote:

Regarding NFS...sure, but how does it help a user/admin to determine after the fact that things are broken? Basically the system is either going to munch the fscaps on /bin/ping or not, a system that tells you "hey the fscaps are missing" may lead you Google faster but that's about it...

Agreed if the goal is going image → filesystem → running container, but I think @philips was concerned with round-tripping from image files to filesystem bundles (image 1 → filesystem → image 2), since he links tar-split which is focused on unpacking and repacking tarballs while preserving the tarball's hash. Folks that are interested in round-tripping through the filesystem would be concerned about mismatches between attributes represented in the filesystem and attributes represented in the image file, but not about freezing content once it's on the filesystem. And folks that want to round-trip in the face of limited filesystems can write tools that stash the unsupported attributes elsewhere and pull them back in when checking for changes, so they can do better than failing fast.

Personally, I don't think round-tripping is particularly useful, because:

philips commented 8 years ago

I am closing this out. The image format work is now part of the OCI Image Format project: https://github.com/opencontainers/image-spec

advancedwebdeveloper commented 4 years ago

I have an aside question: if there is a bug Undefined syscall.TIOCGPTN, syscall.TIOCSPTLCK during compilation of cri-o (gollvm related) , related to https://github.com/containerd/containerd/tree/master/vendor/github.com/containerd/continuity/fs - where should I open issues/report ?

Ivan

thaJeztah commented 4 years ago

@advancedwebdeveloper source code of that package is in https://github.com/containerd/continuity