hpc / charliecloud

Lightweight user-defined software stacks for high-performance computing.
https://hpc.github.io/charliecloud
Apache License 2.0
312 stars 61 forks source link

ch-fromhost: make it work on read-only images #286

Open reidpr opened 5 years ago

reidpr commented 5 years ago

Background

For various reasons, SquashFS is a viable alternative to the current recommendation of unpacking a tarball (with ch-tar2dir) into a tmpfs. SquashFS mounts are read-only. Issue #96 relates a different image mount approach (CVMFS) that is read-only.

While ch-run is happy with read-only image directories (and in fact re-mounts them read-only), ch-fromhost currently depends on modifying an unpacked image directory, which doesn't work with read-only image directories.

This seems sub-optimal. Below are some options to get ch-fromhost-like features to inject shared libraries and other files on read-only images. We have not articulated the pros and cons at this point, merely enumerated some options.

Option 0: Status quo

The current method to deal with this is, assuming SquashFS:

  1. ch-tar2dir
  2. ch-fromhost
  3. mksquashfs(1) the modified directory, creating a non-portable image archive
  4. Mount the squashball.
  5. ch-run

Option 1: Add another directory to ld.so.conf

Edit ld.so.conf to add a new directory, which we prepare in temporary space and bind-mount into the container at run time.

This lets us add additional shared libraries to the standard paths (i.e., to be found by ld.so, the linker), but doesn't address use cases like MCA modules for OpenMPI (which are also .so files) or miscellaneous files for Cray MPICH.

  1. Copy /etc/ld.so.conf from the image directory to /tmp/ld.so.conf.
  2. Add a new directory to the top of this file, say /mnt/chlib.
  3. Create a zero-byte file in /tmp/ld.so.cache.
  4. Create an empty directory /tmp/chlib.
  5. Put the shared libraries to inject in /tmp/chlib.
  6. Bind-mount:
    • /tmp/ld.so.conf/etc/ld.so.conf (overmount)
    • /tmp/ld.so.cache/etc/ld.so.cache (overmount)
    • /tmp/chlib/mnt/chlib
  7. Run ldconfig within the container.
  8. Run user code.

Option 2: Overmount directories with recursive copies

Make a recursive copy of any directories we want to inject into (e.g., in the case of OpenMPI, the first directory in ld.so.conf and /usr/local/lib/openmpi) into host /tmp. Add our files to those directories and bind-mount them in. Bind-mount in an empty, writeable ld.so.cache. Run ldconfig.

Note that we do need the first directory in ld.so.conf because we need to be able to override shared libraries installed anywhere. The first .so found wins.

The recursive copies can be substantial. E.g. /usr/local/lib on Reid's development box is 5,500 files.

Option 3: Overmount directories with symlink farms

Like Option 2, except instead of copying the overmounted directories, we bind-mount them to a second location within the image. In /tmp on the host, we create a new directory containing symlinks to the second location for all existing items, then add our new items.

This is not a recursive process because we need only address the first level in the overmounted directory.

trandles-lanl commented 5 years ago

Opinions on the enumerated options.

Option 0: This can all be accomplished as a normal user but seems cumbersome. It also might not work well with any system where step 4 is a privileged mount handled by some automatic process on the cluster (eg. the user left the squashball in some place the automatic process couldn't find it). This is essentially what the Shifter image gateway does however, so it is workable in principle.

Option 1: This is the approach I would instinctively (and perhaps naively) pursue. Can you elaborate more on why it doesn't address the Cray MPICH-style use cases that ch-fromhost already supports? From my limited understanding of how ch-fromhost handles the CrayPICH case it's not obvious to me why it's unsupported.

Option 2: This might be a reasonable approach for some large fraction of applications. In my experience, Reid's /usr/local/lib is abnormally large. On my development box /usr/local/lib is 33 files.

Option 3: Yuck? This smells bad. EDIT: Sorry, I re-read the OP again and this doesn't smell as bad. I'd be interested in seeing this demonstrated by hand to have a really good grasp of how things end up looking at ch-run time.

reidpr commented 5 years ago

Option 1: [...] Can you elaborate more on why it doesn't address the Cray MPICH-style use cases that ch-fromhost already supports?

The trick is that we need to put arbitrary files of arbitrary type in arbitrary directories. In the case of Cray MPICH, there are some random files and directories scattered here and there; in the case of OpenMPI we need shared libraries in OpenMPI's directories (e.g., /usr/local/lib/openmpi).

Option 2: This might be a reasonable approach for some large fraction of applications. In my experience, Reid's /usr/local/lib is abnormally large. On my development box /usr/local/lib is 33 files.

The bulk of it is Python modules for 3 different versions of libraries.

I believe this also gets caught up in the "arbitrary locations" problem. E.g. what if we need to inject into /opt or /usr/local?

Option 3: [...] I'd be interested in seeing this demonstrated by hand to have a really good grasp of how things end up looking at ch-run time.

Likewise.