hpc / charliecloud

Now hosted on GitLab.
https://gitlab.com/charliecloud/main
Apache License 2.0
312 stars 60 forks source link

Add support for bind mounts to directories not existing within a container on read-only FS #96

Closed olifre closed 10 months ago

olifre commented 6 years ago

On the WLCG containers mailing list, somebody suggested the following magic to be able to add bind-mount points to a read-only container without using overlayfs.

Is there a logic error in this approach?

olifre commented 6 years ago

A side-effect I see is that the user in the user-namespace is then effective owner of / (I think), so new potentially problematic possibilities arise (such as renaming /etc and recreating a new /etc, filling up the temporary local directory etc) which were impossible with a read-only container filesystem.

reidpr commented 6 years ago

We do this for /home right now, except we only add /home/$USER and not the other home directories.

I'm a little hesitant to put it in the C code since it seems kind of complex. Would a mkdir(2) at unpack time suffice?

We haven't yet clarified the contract on the unpacked image. For example, it is portable between machines? Can you just tar it up again to get a valid image tarball?

olifre commented 6 years ago

Would a mkdir(2) at unpack time suffice?

For the use case in mind, this would not be sufficient: The idea is that a third party (in our case, CERN / WLCG) provides containers via CVMFS as read-only directores. So they perform the unpack stage. These containers will be used at many different places on vastly different machines, which may require different bind mount points to make local filesystems accesible, for which the directories may not yet exist in the containers.

The trick described here (and by now also suggested here https://github.com/singularityware/singularity/issues/1207 ) would allow to freely specify any bind mount point inside the container without requiring a decision already at the unpack stage.

I think the C-code is the only suitable place in that case, since the bind mounts need to be performed after activating the user namespace. But I agree, it's kind of complex, so I am not sure it should be the highest thing on the priority list of enhancements ;-).

reidpr commented 6 years ago

OK, thanks for the clarification.

Do you (or does anyone) know the prospects for overlayfs? I did try to implement ch-run using it, which worked fine on Ubuntu and then I learned when I went to the upstream kernel that it wasn't supported in combination with user namespaces.

reidpr commented 6 years ago

Also, is this a showstopper for you?

olifre commented 6 years ago

Also, is this a showstopper for you?

For our site, no (we don't need anything special in terms of bind-mounts). Our main showstopper right now is HTCondor's lack of correct support for any container implementation, which at the moment means only setuid root containers work correctly. Sadly, their upstream is very unresponsive, so I'm working on workarounds for now, and until this is done, we have to stay with privileged Singularity.

Since WLCG (Worldwide LHC Computing Grid) is currently on the "Singularity-train", they will likely also not really care (but it would be a showstopper for one of the experiments if Singularity would not implement it.

My goal here would be to have the useful functionality in Charliecloud to have an alternative runtime which fulfills all the necessary requirements, and also, it looks like a reasonable extension, since it makes containers built by a third-party and distributed in a read-only manner more portable. I'll likely also ask the runC people about it. Sadly, I don't know anything about the prospects of overlayfs, I only know that it does not work with user namespaces as of yet, which is really sad, since this would of course be a significantly easier solution.

DrDaveD commented 6 years ago

Ubuntu has made their own modification to allow unprivileged overlayfs and it's not expected to get into the mainstream kernel anytime soon. I haven't found a definitive source I can point you to for that, but see https://lwn.net/Articles/671641/, especially the comments at the end.

DrDaveD commented 6 years ago

On the other hand https://lwn.net/Articles/718062/ says that "There has been a fair amount of work in adding support for unprivileged containers" to overlayfs. No details though.

reidpr commented 6 years ago

Thinking about whether this should go into 0.2.4.

Couple other options. Would these satisfy the use case?

olifre commented 6 years ago

Provide the read-only images as .tar.gz and unpack into RAM on each node (e.g. /var/tmp).

This seems very inefficient: Distributing full .tar.gz via CVMFS (or other means) is significantly more waste of space on the servers and in caches than distributing just the deltas. On our site, we build new containers of several flavours at least once a day, with sizes ~ 1G. The deltas to the last build are just a few MB, though, so only the small changes need to be transferred on-demand. If old jobs are still running, they will use old versions of the containers, while newly started ones will use new versions, so several full containers need to be stored. The extraction (if done for each user's computing job, which would be the easiest, and in any case needed if we allow for custom containers) would be a significant overhead, and use significant amount of memory (56 user jobs per host, 1 GB per container). Also, the image distribution technique chosen for WLCG images via CVMFS is already pretty much fixed to be the extracted file structure and not tarballs (since it's more efficiently handled with CVMFS). And: RAM (and IO) are usually the limiting factors for high throughput computing clusters (which is of course different for pure HPC clusters), so whatever can be safed in this regard is crucial, and CVMFS is really helpful on this.

CVMFS helper on each node, privileged, that puts an overlayfs on top of the CVMFS mount.

This could work for local use on sites, but could not be easily used independent from the site - and could not be made available easily to users. I expect there will be some users directly working with images from CVMFS, e.g. the ones publicly provided by CERN, OpenScienceGrid, and probably other providers in the future. They may like to do that on their regular desktop machine, laptops etc. For these cases, it would be best if the container runtime would allow to specify custom bind mounts completely independent of the image - and without execution of a privileged helper.

I think this pretty much summarizes the use case, maybe @DrDaveD can expand, he is closer to the WLCG working group on the topic. Since a safe and well-tested implementation in the C code base (string handling, lots of error handling) is not so quickly done, maybe this is too much for 0.2.4 - but I don't know.

reidpr commented 6 years ago

This seems very inefficient ....

OK

This could work for local use on sites, but could not be easily used independent from the site

OK

Since a safe and well-tested implementation in the C code base (string handling, lots of error handling) is not so quickly done

I'm not actually convinced of that; we already have a partial solution that does it for /home/$USER and it's not too hairy. Let's at least develop a patch and see how it looks.

DrDaveD commented 6 years ago

@olifre I saw your message but you summarized well, I really can't think of anything to add.

One thing that I don't see mentioned in this issue but which is in the second comment of singularityware/singularity#1207 and is an answer to the second comment in this issue, is to make '/' be a read-only bind mount of the separate scratch area, so the user cannot modify it.

olifre commented 4 years ago

I just stumbled upon: https://github.com/containers/fuse-overlayfs which could also solve this issue by providing OverlayFS-like functionality rootlessly (but requiring rather recent libfuse and a recent kernel).

DrDaveD commented 4 years ago

@olifre I'm sorry you weren't aware of that, I have known about it for quite some time. It does require linux kernels >= 4.18, such as on CentOS 8. Meanwhile it's not been reported here, but singularity has had the underlay feature for over a year, first in the C++ 2.6 series and soon thereafter in the golang 3.x series.

olifre commented 4 years ago

@DrDaveD Thanks for chiming in! I was indeed aware of the technical possibility only since a while (and I know about https://github.com/cvmfs-contrib/cvmfsexec of course) but I was unaware of fuse-overlayfs as a ready-to-use tool for integration with container runtimes in similar spirit as there is slirp4netns for networking.

tylerjereddy commented 4 years ago

I can't tell if this is the exact same issue, sounds similar to title I think--one thing Docker allows (mounting to a nested path that does not already exist in the image) seems to be prohibited by Charliecloud:

ch-run --no-home -b /home/tyler/some/git-repo:/workdir/git-repo /var/tmp/image-name:0.1.0 -- /bin/bash

ch-run[16232]: can't bind: not found: ... (ch_core.c:100)