Closed philips closed 8 years ago
After some thought, we need a format that does the following:
Meeting the above three use cases with this format puts us above the test for Cost problems with the ideas from #5. Requirement 2 above has a common use case, by avoiding placing unreasonable requirements on transports. Requirement 3 above gives us the functionality of #5 with the extra benefits of Requirement 2.
@shykes @philips @crosbymichael
@stevvooe I am having a hard time parsing this sentence: "Meeting the above three use cases with this format puts us above the test for Cost problems with the ideas from #5."
I agree with all three needs overall despite my confusion above.
@philips That is a poor sentence where I've shoved in a lot of meaning.
What I'm saying is that, given the goals of #5 (cryptographic verification), the cost of scanning a bundle is not warranted. Given these new goals, the cost is warranted and it makes bundles portable over different transports. Basically, we have a solid reason for filesystem scanning that could also be used as a signable target.
@stevvooe Ack. So the next step is a .proto file?
No better way to get started than with a straw man:
syntax = "proto3";
package ocf.bundle;
// BundleManifest specifies the entries in a container bundle, keyed and
// sorted by path.
message BundleManifest {
message Entry {
// path specifies the path from the bundle root
string path = 1;
// NOTE(stevvooe): Need to define clear precedence for user/group/uid/gid precedence.
string user = 2;
string group = 3;
uint32 uid = 4;
uint32 gid = 5;
// mode defines the file mode and permissions. We've used the same
// bit-packing from Go's os package,
// http://golang.org/pkg/os/#FileMode, since they've done the work of
// creating a cross-platform layout.
uint32 mode = 6;
// NOTE(stevvooe): Beyond here, we start defining type specific fields.
// digest specifies the content digest of the target file. Only valid for
// regular files. The strings are formatted as <alg>:<digest hex bytes>.
// The digests are added in order of precedence favored by the
// generating party.
repeated string digest = 7;
// target defines the target of a hard or soft link, relative to the
// bundle root.
string target = 8;
// specifies major and minor device numbers for charactor and block devices.
string major = 9;
string minor = 10;
message XAttr {
string name = 1;
string value = 2;
}
// xattr provides storage for extended attributes for the target resource.
repeated XAttr xattr = 11;
// AlternateDataStream represents NTFS Alternate Data Streams for
// the targeted resource.
message AlternateDataStream {
string name = 1;
bytes value = 2;
}
// ads stores one or more alternate data streams for the given resource.
repeated AlternateDataStream ads = 12;
}
repeated Entry entries = 1;
}
Changes:
AlternateDataStream
type, formerly ADS
@stevvooe looks pretty good. Two questions:
What is an ADS?
This is NTFS equivalent of extended attributes (sort of), known as "Alternate Data Streams". The semantics are slightly different, so I've pulled it out into a separate type. Notice the use of type bytes
for the value, instead of string. I'd like to get some feedback from a Windows expert to see if this is sufficient.
Should digest be repeated so we can deprecate old hashes and upgrade to new ones over time?
In this case, I don't see why not.
I've updated the comment in-line.
@philips We may also want to define an exclusion operator to the manifest specification, since it operates at the bundle level.
enum Op {
// EXCLUDE specifies that the matched files should be explicitly exlcuded
// from the manifest. They may be still part of the bundle.
EXCLUDE = 0;
// INCLUDE specifies that the resource should be include in the manifest.
// This has the act of "pinning" the resource. For example, if the resouce
// is later matched by an exclude statement, it will still be included.
INCLUDE = 1;
}
message PathSpec {
Op operation = 1 [default=EXCLUDE];
// path specifies a path relative to the bundle root. If the path is a
// directory, the entire tree will be excluded.
string path = 2;
// pattern specifies a glob to match resources and apply the operation.
string pattern = 3;
}
// path specifies the bundle paths covered by the manifest. Specifications are
// order by precedence. For a given path, only the first matching
// specification applies. During processing, it is important to fully process
// all resources, even if a directory is excluded, since child resources may
// first match an inclusion.
repeated PathSpec pathSpec;
Benefits:
Cost:
Another possibility is to allow this to be specified on the command line when first building the manifest. That doesn't allow us to catch "extra" files, but that may not be that important and likely doesn't warrant the extra complexity.
Okay, so after reading this, I think this makes a bit more sense to me . . . I think the area where I'm having trouble is understanding the scope vis a vis the goals...
I think what I've read so far is well thought out and reasoned, so props to everyone , heh. Forgive me ahead of time, but I wanted to share some thoughts. Discard if you don't think they are useful.
There seems a basic consensus -- in the limited info I've read on the container crypto goals --that everyone would probably be on board with the following ideas /in principle/ , certainly as options... These are labeled as the 'Basic Themes' below .. areas where there's agreement:
Just as general thought, first I'm going to put these here just as problems I have run into myself with design of cryptography... You guys may know this already but writing this down helps me organize my thoughts.
I'll be back with specifics tomorrow or Saturday.
What are we trying to protect (data, secrets, access, etc)? What is our tolerance in terms of a threat vs it's complexity/likelyhood?
I've got a series of specific questions in regards to the proposed standards here (which I think are pretty awesome )... but I'm tired at the moment -- I'll post the specific questions / comments tomorrow: I hope this helps in the time being . . . This is my take on how to think about crypto engineering.
General Questions:
Specific Questions (assuming key distribution is solved, heh... like with OpenPGP):
BTW these questions arent meant to all have answers obviously. They are more like engineering food .
Anyway -- thanks, good work, and good luck gentleman. Looking forward to writing about the details tomorrow if time permits, as you all have some really good ideas here, and I'm sure you'll sort this all out.
@vbatts https://github.com/stevvooe/continuity has been opened up to continue this research.
Probably related to #302. Will need to be considered as part of that.
I think the https://github.com/GNOME/ostree format has a lot of advantages. It was designed from the start to be checksummed. If implementing anything else, at least study it.
For example:
@cgwalters We have been researching a number of approaches while working on https://github.com/stevvooe/continuity. The big difference is that continuity does not prescribe a distribution format while keep metadata transport consistent.
It intentionally does not include device files, because why would you have devices in container images?
We've found that having an opinion here will work the system into odd chicken-egg problems. For example, if we rely on runc to create a device, how do we specify the ownership parameters in the archive format? We'd have to call into runc to create the devices, then call back out to the archiver to apply the metadata, then back into runc for runtime.
There are also other filesystem objects, such as sockets and named pipes, that may need to be serialized when migrating a process.
It doesn't include timestamps per file, because immutable containers don't need them.
We've gone back and forth on this requirement. The main issue here is that if you want stable regeneration, you cannot have timestamps in the metadata. However, let's say you want to pick a compilation process mid build and then resume it on another node. Modification times are very important here. When you start examining this, there are a number of applications that would behave in odd manners when all of the timestamps are from the extraction time.
Mostly, we can obviate this need by not trying to regenerate an expanded artifact. IMHO, it imposes challenging requirements on the transport format that don't ultimately serve the user while introducing security problems in the pursuit of hash stability (see: tarsum).
(And if you do need timestamps, just do what git does and derive them from the commit object timestamp).
Interesting. I did not know this. Very cool!
xattrs are part of the per-file checksum (Although I think container images shouldn't include xattrs, we should drop setuid binaries and file caps for more secure containers)
We have this in continuity to some degree. There are lots of applications that cannot work correctly without xattrs, in addition to setups that require setuid.
In the past few weeks of development and experimentation, we've actually found the right model is to have continuity collect as much information as possible, then provide tools to selectively apply metadata and verify the on disk data.
On Fri, Feb 12, 2016, at 03:00 PM, Stephen Day wrote:
We've found that having an opinion here will work the system into odd chicken-egg problems. For example, if we rely on runc to create a device, how do we specify the ownership parameters in the archive format? We'd have to call into runc to create the devices, then call back out to the archiver to apply the metadata, then back into runc for runtime.
Any non-privileged container should only see the "API" devices
(/dev/null
etc.) Any privileged container is, well, privileged and
can create the device nodes itself. Why would you ship pre-created
device nodes in an in an image?
There are also other filesystem objects, such as sockets and named pipes, that may need to be serialized when migrating a process.
Migration is data, not images. Use tar or whatever for that. And data should be cleanly separated in storage from the image.
It doesn't include timestamps per file, because immutable containers don't need them.
We've gone back and forth on this requirement. The main issue here is that if you want stable regeneration, you cannot have timestamps in the metadata. However, let's say you want to pick a compilation process mid build and then resume it on another node. Modification times are very important here. When you start examining this, there are a number of applications that would behave in odd manners when all of the timestamps are from the extraction time.
Again, that's a data case, not immutable images. I think using container images as a backup format doesn't make sense. A vast amount of backup software already exists. Yes, one needs to cleanly separate code from data, but that's a fundamental requirement for upgrades anyways.
@cgwalters I am not sure if you saw it, but I made the following point at the bottom of my comment:
we've actually found the right model is to have continuity collect as much information as possible, then provide tools to selectively apply metadata and verify the on disk data.
This approach is compatible with all of the points identified, while not limiting the capability of containers.
In general, images are data, as well. Indeed, a large number of backup software and distribution channels for filesystem images do already exist. Why not make an archive format that is compatible with all of them? Conversely, why require a backup solution in addition to the ability to snapshot and archive containers? Both are acceptable use cases at either end of a continuum. It would be unfortunate to disallow one based on an arbitrary opinion, even if well-grounded.
Ultimately, deciding what a container or image archive can and cannot do just isn't productive. Shipping metadata is inexpensive and the user can always choose to unpack them or not.
In one view, sure it's all "just files". But I think there's a strong argument to have separate tools and data formats for different problem domains (source code, binaries, database backups) that share ideas rather than trying to do one format for everything. git is already good for source code and text, etc.
Don't underestimate the cost of inventing a new file format for things like mirroring, versioning, language bindings for parsers, etc.
Going back to the top of the motivation here:
Problem: a rootfs for a container bundle sitting on-disk may not reflect the exact intended state of the bundle when it was copied to its current location.
I'd say the correct solution here is for the container runtime to work with the storage layer to ensure immutability. See http://www.spinics.net/lists/linux-fsdevel/msg75085.html for a proposal there. It'd require plumbing through from the filesystem to the block level, but I think the end result would be simply better than classic tools like tripwire and IMA, as well as whatever verification is invented here. (Yes, that proposal doesn't cover xattrs, we'd want a way to freeze specific xattrs too likely)
@cgwalters Is there a windows port for OSTree?
On Thu, Feb 25, 2016 at 01:30:26PM -0800, Colin Walters wrote:
Going back to the top of the motivation here:
Problem: a rootfs for a container bundle sitting on-disk may not reflect the exact intended state of the bundle when it was copied to its current location.
I'd say the correct solution here is for the container runtime to work with the storage layer to ensure immutability.
The O_OBJECT proposal you link is about preserving filesystem content after it lands on the filesystem, but it looks like @philips initial concern was about landing it on the filesystem in the first place. For example, “my FAT-16 filesystem doesn't support POSIX permissions, so my rootfs/foo/bar seems to have 0777 instead of the source's 0600”.
I struggle to understand a scenario where one would reasonably want to unpack container content onto FAT-16 and expect to run it. Maybe inspection, but even then, you can do that from userspace easily enough with libarchive
or whatever. If you have a Linux container, you have Linux...hence you have xfs/ext4/etc.
On Thu, Feb 25, 2016 at 03:28:36PM -0800, Colin Walters wrote:
I struggle to understand a scenario where one would reasonably want to unpack container content onto FAT-16…
A poor choice of example, but @philips was pointing out that not all filesystems support the same attributes (he pointed out NFS without xattrs, among other things). Regardless of the specific examples, unpacking into a local filesystem (what @philips was talking about 1) and maintaining content after that unpacking (what you were talking about 2) are two separate things.
Anyways my goal here is to try to ensure sharing of ideas, not necessarily code in this area - OSTree is certainly not going to take over the world as a way to get content from A to B any more than other projects in this area. Another good project to look at is Clear Linux: https://lists.clearlinux.org/pipermail/dev/2016-January/000159.html
A good example of a mistake in OSTree - I've come to realize the git-like Merkle tree model was a mistake for binary content, because it's really common with software updates for one "package" to change multiple paths (due to /usr/bin
and /usr/share
etc.) For git and source code it's a lot more common to only change one subdirectory.
So the Clear Linux manifest makes sense - there's no streaming, but that's fine because we aren't storing huge amounts of content to tape drives.
Also, OSTree not including the size in the tree metadata was really dumb but that's papered over with static deltas.
Speaking of deltas...that's another area where Docker really lacks, and for OSTree I ended up taking a ton of inspiration from http://dev.chromium.org/chromium-os/chromiumos-design-docs/filesystem-autoupdate. For more on that see https://ostree.readthedocs.org/en/latest/manual/formats/
Regarding NFS...sure, but how does it help a user/admin to determine after the fact that things are broken? Basically the system is either going to munch the fscaps on /bin/ping
or not, a system that tells you "hey the fscaps are missing" may lead you Google faster but that's about it...
fscap binaries can be pretty easily worked around in an NFS root scenario by copying them into tmpfs or something on boot. Yes, it's ugly, see: https://bugzilla.redhat.com/show_bug.cgi?id=648654#c19
Also, I'd like to go on a crusade to kill off setuid binaries in containers - they're legacy, and in a container world we should always run with NO_NEW_PRIVS on. Use containers as a reason to leave behind the continual security issues of setuid, and just have them on the host until someone rewrites PAM and /sbin/unix_chkpwd
etc.
On Thu, Feb 25, 2016 at 06:31:09PM -0800, Colin Walters wrote:
Regarding NFS...sure, but how does it help a user/admin to determine after the fact that things are broken? Basically the system is either going to munch the fscaps on
/bin/ping
or not, a system that tells you "hey the fscaps are missing" may lead you Google faster but that's about it...
Agreed if the goal is going image → filesystem → running container, but I think @philips was concerned with round-tripping from image files to filesystem bundles (image 1 → filesystem → image 2), since he links tar-split which is focused on unpacking and repacking tarballs while preserving the tarball's hash. Folks that are interested in round-tripping through the filesystem would be concerned about mismatches between attributes represented in the filesystem and attributes represented in the image file, but not about freezing content once it's on the filesystem. And folks that want to round-trip in the face of limited filesystems can write tools that stash the unsupported attributes elsewhere and pull them back in when checking for changes, so they can do better than failing fast.
Personally, I don't think round-tripping is particularly useful, because:
Folks who want to generate a new image that reuses content addressable objects from an earlier image (e.g. adding a few files to a stock Debian image to create a new image) can handle that locally (e.g. with something like Git's staging area to bless changes they're interested in). There's no need to address this at the protocol / file-format level.
Subject: Re: OCI Bundle Digests Summary Date: Thu, 15 Oct 2015 16:52:42 -0700 Message-ID: 20151015235242.GD28418@odin.tremily.us
I am closing this out. The image format work is now part of the OCI Image Format project: https://github.com/opencontainers/image-spec
I have an aside question: if there is a bug Undefined syscall.TIOCGPTN, syscall.TIOCSPTLCK during compilation of cri-o (gollvm related) , related to https://github.com/containerd/containerd/tree/master/vendor/github.com/containerd/continuity/fs - where should I open issues/report ?
Ivan
@advancedwebdeveloper source code of that package is in https://github.com/containerd/continuity
@stevvooe and I caught-up in person about our digest discussion and the need for serialize file-system metadata. If you want to read my attempt it is found here: https://github.com/opencontainers/specs/issues/5#issuecomment-114208979
Problem: a rootfs for a container bundle sitting on-disk may not reflect the exact intended state of the bundle when it was copied to its current location. Possible causes might include: running on filesystems with varying levels of metadata support (nfs w/o xattrs), accidental property changes (chown -R), or purposeful changes (xattrs added to enforce local policies).
Obviously the files contents will be identical so that isn't a concern.
Solution: If we hope to create a stable digest of the bundle in the face of these likely scenarios we should store the intended filesystem metadata into a file itself. This can be done in a variety of ways and this issue is a place to discuss pros/cons. As a piece of prior-art @vbatts has implemented https://github.com/vbatts/tar-split and we have the linux package managers with tools to verify and restore filesystem metadata from a database with
rpm -a --setperms
andrpm -V
.