michael-yuji / xc

FreeBSD container engine
Other
87 stars 9 forks source link

Questions #1

Closed xorander00 closed 1 year ago

xorander00 commented 1 year ago

Hi, just ran into this project. Looks good! I'm actually in the process of implementing something similar, though not nearly as far along as your project is.

I'm going to look at the code next week sometime, but I figured I'd start off with some questions/comments in the meantime.

I'll take a look at the source a bit later, and if I can contribute anything, then I'd be happy to do so (though my time is a bit short right now so it'll take a while).

xorander00 commented 1 year ago

Oh, and volume mounting is another thing. Since jail.conf supports mount.fstab, it should be simple enough to mount NFSv4 paths into the jail, along with jailed ZFS datasets.

michael-yuji commented 1 year ago
  • I took a quick look at the patches to ifconfig and route, glad to see they're finally going to get support for jails. I was using jexec for that purpose and wasn't a big fan of having to do that (plus it allows for thinner jails since the host executable can be used). I did a quick search for netgraph but didn't see anything, are you using epair or some other alterative? If not using netgraph, is there any interest in using it?

For managed network with VNET jails I am currently using epair instead of netgraph. Implementing netgraph support instead of vnet is not hard and I am certainly interested, but just don't have the time to implement it, similar happens to ipfw (in fact ipfw seems to be a better choice than pf for a lot of the things I'm interested to do)

  • How are OCI manifests (both index and image) being handled in terms of FreeBSD-specific configuration? I assume the index just lists FreeBSD, which shouldn't be an issue (I don't think). Are runtime-specific settings being stored in the free-form image manifest?

I'm currently using my own image config format, which allows xc to do quite a few cool things, for example to determine the environment variables needed for a container even before a container started, and volume hints such that the image can tell the hint the user the best way to create volumes (such as ZFS properties)

In terms of runtime specification, there isn't an equivalent one to the OCI runtime spec, the daemon sort of just take the image config and create the containers with jail(2), in fact, xc does not use jail.conf at all.

I'll take a look at the source a bit later, and if I can contribute anything, then I'd be happy to do so (though my time is a bit short right now so it'll take a while).

That'd be super awesome! I am currently refactoring / implementing many things so they can change a lot, but the core architecture should stay about the same.

michael-yuji commented 1 year ago

Oh, and volume mounting is another thing. Since jail.conf supports mount.fstab, it should be simple enough to mount NFSv4 paths into the jail, along with jailed ZFS datasets.

Currently xc do not use jail.conf at all. mounting is done ad-hoc by the daemon and is implemented in an interesting way...

The context is, xc is actually intended to build to be multi-user friendly, in fact if you change the ownership of the main socket you can use xc as an unprivileged user. The challenge is how to make it safe to use in a multi-user environment.

Currently, in terms of copying files into the container, xcd requires the client process to first open the file as an fd and pass to the daemon, such that if the user can't open the file in the first place, there's not way to exploit and steal the content of the file by creating a container and copy the file into it.

Mounting is however done quite differently, by default, we support mounting by path, but the daemon will check the client credential and determine if the user with such uid or gid can actually rwx the directory they intended to mount, this comes with the issue they the user with same uid/gid can come from another Jail. The right way to implement it is to have the client send the dirfd, along with the path to the daemon so it can verify that dir at the path is indeed the same inode.

For these reasons, directly mounting NFS to a container by the user is current unsupported as to do it safely it requires us to implement managed volume first (similar to docker volume), in that case the access control checks can un-tie from the OS primitives but solely relies on some ACL / RBAC rules stored somewhere. (It's not like xc had great security right now as the ACL/RBAC part is still unimplemented along with rctl, but I want xc to be fairly usable / matching the expectations first)

I personally have a lot of interest in managed volume, as I am myself using xc to build FreeBSD (see this repo), but it is currently not really the top priority yet.

xorander00 commented 1 year ago
  • I took a quick look at the patches to ifconfig and route, glad to see they're finally going to get support for jails. I was using jexec for that purpose and wasn't a big fan of having to do that (plus it allows for thinner jails since the host executable can be used). I did a quick search for netgraph but didn't see anything, are you using epair or some other alterative? If not using netgraph, is there any interest in using it?

For managed network with VNET jails I am currently using epair instead of netgraph. Implementing netgraph support instead of vnet is not hard and I am certainly interested, but just don't have the time to implement it, similar happens to ipfw (in fact ipfw seems to be a better choice than pf for a lot of the things I'm interested to do)

I haven't used ipfw in 15+ years now, so that's interesting. I'll have to take a look at it again compared to pf, which has been my default choice for a long time now.

  • How are OCI manifests (both index and image) being handled in terms of FreeBSD-specific configuration? I assume the index just lists FreeBSD, which shouldn't be an issue (I don't think). Are runtime-specific settings being stored in the free-form image manifest?

I'm currently using my own image config format, which allows xc to do quite a few cool things, for example to determine the environment variables needed for a container even before a container started, and volume hints such that the image can tell the hint the user the best way to create volumes (such as ZFS properties)

Ah yeah, that's the approach I'm taking too. I'd like to directly support OCI, but the design seems fairly Linux-centric. I was just going to add the extra configuration as annotations under a FreeBSD-specific namespace (e.g. org.freebsd.oci). I also looked at at possibly using buildah to build and publish images and then using one of the existing crates for fetching, verifying, and loading the image. Ideally I want to take advantage of ZFS incremental snapshots such that the process would practically look like this for my usage:

  1. Fetch the image manifest from the registry.
  2. Verify manifest integrity using signed signature.
  3. Read the list of layers required for the container from the manifest.
  4. Iterate over the list of required layers in descending order (newest to oldest) and check if an existing ZFS dataset is already present that matches it, and if not then check if the local blob cache has the blob archive for the layer (zstd-compressed zfs-send dump in my case), and finally if not found in either place then mark that layer as needing to be fetched. Continue iterating until oldest layer is reached OR until the ZFS dataset can be fully loaded (basically if it reaches a layer that's a full snapshot dump and not an incremental snapshot).
  5. Fetch missing blobs from registry, verify integrity using signed signature, and save into local blob cache.
  6. Re-run step 4 above to produce a final list of steps for loading the ZFS dataset (i.e. zfs-clone, zfs-send+zfs-recv).
  7. Finally load the ZFS dataset.
  8. Set metadata on the ZFS dataset using custom properties.

After step 8, the ZFS dataset can then be cloned (or send-recv if preferred instead), mounted, and used as the root for the jail container. The OCI specification expects the usage of tarballs though (and I think also expects white-out files). ZFS is just so much nicer here, IMO. I think I did see an initiative to revise/extend the OCI spec to natively utilize ZFS instead of assuming tarballed file systems. The only downside is that ZFS doesn't seem to have a tested and stable user API. There are a couple of community options, but the alternative would be to generate native bindings, which isn't too terrible.

In terms of runtime specification, there isn't an equivalent one to the OCI runtime spec, the daemon sort of just take the image config and create the containers with jail(2), in fact, xc does not use jail.conf at all.

Yup, my approach is almost the same. My preference is to utilize native libs and calls. I currently use Nomad as my orchestrator, and so I'm writing it as a task driver for it.

I'll take a look at the source a bit later, and if I can contribute anything, then I'd be happy to do so (though my time is a bit short right now so it'll take a while).

That'd be super awesome! I am currently refactoring / implementing many things so they can change a lot, but the core architecture should stay about the same.

Sounds good!

michael-yuji commented 1 year ago

For managed network with VNET jails I am currently using epair instead of netgraph. Implementing netgraph support instead of vnet is not hard and I am certainly interested, but just don't have the time to implement it, similar happens to ipfw (in fact ipfw seems to be a better choice than pf for a lot of the things I'm interested to do)

I haven't used ipfw in 15+ years now, so that's interesting. I'll have to take a look at it again compared to pf, which has been my default choice for a long time now.

pf is great for many reason, but ipfw allow matching jid and self makes it very interesting, additionally, it supports npt66

Ah yeah, that's the approach I'm taking too. I'd like to directly support OCI, but the design seems fairly Linux-centric. I was just going to add the extra configuration as annotations under a FreeBSD-specific namespace (e.g. org.freebsd.oci). I also looked at at possibly using buildah to build and publish images and then using one of the existing crates for fetching, verifying, and loading the image.

That says, xc do support OCI config directly too, it just do a conversion to the native image config format. I am considering the necessity to do the reverse (such that we can push a "normal" OCI config up), but if there ain't any other OCI-compatible FreeBSD container implementation anyway, I'm not sure if it's worth it.

Ideally I want to take advantage of ZFS incremental snapshots such that the process would practically look like this for my usage:

The procedure you described is almost identical as the current xc one (except that only the final "product" is preserved as the amount of dataset in zfs listcause me eye sore), the things I am going to switch to is to tag (zfs snapshot) each intermediate layer after they extracted, and only zfs clone when branches are encountered.

ZFS send/recv can be less than ideal in the use case of container tho because of how it works, even if the result of the ZFS send/recv at objects level are identical, because ZFS send/recv is in block level you may get different representation, and there are no ways to check if they are the same unless actually receiving the stream, which makes it not an ideal choice for layers.

If you are interested, you can check the ocitar crate in this project, which acts as a proxy for bsdtar that injects / process whiteout files in a stream.

michael-yuji commented 1 year ago

In terms of task driver my current plan is to implement a CRI layer for Kubernetes, since xc is capable of doing most of the things CRI required already. Ofc a Nomad driver should be way easier as in that case porting kubelet is not required.