Incompatibility with charliecloud@0.34

nschan commented 1 year ago

Bug report

Running nextflow@23.10.0 together with charliecloud@0.34 yields an error:

error: can't run directory images from storage (hint: run by name)

Expected behavior and actual behavior

The expected behaviour is that nextflow is able to start the container. This is related to a change in charlieclouds behaviour (https://github.com/hpc/charliecloud/pull/1505)

Steps to reproduce the problem

This problem should occur in any case where charlielcloud > 0.33 is used as the container engine for nextflow. I do not think that this problem is specific to the nextflow version used.

Program output

Command exit status:
  1

Command output:
  (empty)

Command error:
  ch-run[64707]: error: can't run directory images from storage (hint: run by name) (ch-run.c:320)

Environment

Nextflow version: 23.10.0
Java version: openjdk 11.0.13 2021-10-19
Operating system: Linux
Bash version: GNU bash, version 4.4.23(1)

Additional context

nschan commented 1 year ago

I realized that this issue is tied to $CH_IMAGE_STORAGE. ch-run refuses to allow images from its storage directory by path, and accepts these only by name. One possible workaround is to unset CH_IMAGE_STORAGE and use NXF_CHARLIECLOUD_CACHEDIR pointing to the same directory before attempting to run a pipeline.

reidpr commented 1 year ago

Charliecloud project lead here. Thanks for this bug report!

This behavior is by design. In general, the storage directory has never been a stable API, so folks should not depend on its structure. However, we've only been erroring on ch-run against images in the storage directory since Charliecloud 0.31, I think.

You can now run images in storage by specifying their name, e.g. instead of ch-run $CH_IMAGE_STORAGE/img/foo ... you would say ch-run foo .... (Note that image names are encoded, so you can't reliably say ch-run $(basename $CH_IMAGE_STORAGE/img/foo).)

However, ch-run -w foo is disallowed because this is quite likely to corrupt the storage directory. In this case you would have to export to another directory with ch-convert, then ch-run -w that other directory. (However, best practice is to avoid -w and bind-mount output directories and whatnot.)

reidpr commented 1 year ago

An alternative to ch-convert would be to build a small derived image containing the desired mount directory.

See also the embarrassingly old hpc/charliecloud#96.

nschan commented 10 months ago

Since charliecloud will soon provide --write-fake, are there plans to add support for this to nextflow? I guess it would require some checks for the charliecloud version used, or removing support for versions that do not support --write-fake. As I understand versions <0.33 are fine with -w from the cache directory, but >0.33 not. Presumably --write-fake would be supported at some point in the future (>0.35 ?)

reidpr commented 10 months ago

Unfortunately, no version of Charliecloud is fine with -w in the storage directory, though consequences vary. We didn't enforce this until recently, however.

--write-fake will be available in 0.36, to be released soon.

nschan commented 10 months ago

I guess I should have said 'works, although it is not a great idea' :). I think using --write-fake for upcoming versions is the best solution, the main issue I see is how older versions of charliecloud could be supported, or if that is even necessary. @phue what are your thoughts on this?

phue commented 10 months ago

The sole purpose of that -w flag was to enable bind-mounting of the nextflow work directory when using public container images. Now that charliecloud soon supports overlayfs, it should be dropped. And yes, compatibility with earlier versions is problematic, but that has happened before because of another breaking change

The shared (multi-user) cache directory you are using is another can of worms though, I currently don't see how that can work properly with the rather recent git-based layer caching used in charliecloud. I think the way forward is to just drop the CharliecloudCache class and leave it to the user to pre-pull images if desired.

@nschan Could you try these config settings with the current charliecloud git, to see if that already works with the latest nextflow version?

charliecloud {
  cacheDir = false
  readOnlyInputs = true
  runOptions = "--write-fake"
}

nschan commented 10 months ago

I generally agree that breaking backwards compatibility could be justified in this case.

I see the issues of the multi-user cache.. In practice that cache is pretty much only used by me, but we will move away from a shared cache.

Anyhow, the approach you suggested does not currently work (nextflow@23.10.1, charliecloud@0.36~pre+78aae5e), but I assume that is something for @reidpr ?

Command error:
  ch-run[47171]: error: can't read: /ch/environment: Stale file handle (ch_misc.c:259 116)

When i try to run the container manually with -W or --write-fake I get this:

ch-run[35682]: error: can't overlay: Operation not permitted (ch_core.c:323 1)

This is using quay.io/biocontainers/agat:1.1.0--pl5321hdfd78af_1

Edit: Since you mention dropping CharliecloudCache, it is anyway somewhat necessary to pre-pull images, since in many cases pipelines will start a bunch of different processes in the first step, which often leads to #3566

Another edit: hpc/charliecloud/pull/1793 mentions kernel >5.11 ideally >6.6. The hpc system I am using is running on 4.12.14, so I guess that explains why --write-fake does not work for me...

nschan commented 10 months ago

To sum up:

I think the fact that relying on --write-fake would break compatibility with older kernels is very problematic. Not supporting older charliecloud versions is one thing, but dropping support for older kernels seems a bit drastic.

I think an alternative could be something along the lines of building write-able process specific containers. This would hopefully solve most of the problems associated with running containers from a cache. I admit my understanding of containers is pretty basic, but as I read the current implementation all parallel processes that use the same container image run in the same container (irrespective $CH_IMAGE_STORAGE set). Wouldn't it be better to have one dedicated container per process? Using a cache and ch-convert (as sugested by @reidpr earlier) seems like a reasonable approach to me. This would keep things backwards compatible, avoid the current issues with -w, and might actually be the correct way to run containers..? It would require some adjustments to the CharliecloudBuilder, essentially adding a step of ch-convert to create the container. I am not sure if the new container should be stored in the current work directory, or if that will create a horrible mess when the work directory including the container is mounted into the container?

Of course guess adding -w --unsafe instead of -w would be another (strongly discouraged) option.

reidpr commented 10 months ago

Anyhow, the approach you suggested does not currently work (nextflow@23.10.1, charliecloud@0.36~pre+78aae5e), but I assume that is something for @reidpr ?

Command error:
  ch-run[47171]: error: can't read: /ch/environment: Stale file handle (ch_misc.c:259 116)

“Stale file handle” is something about inode re-use. I know it often comes up in NFS. If it’s a Charliecloud bug we definitely want to fix it; is there a reproducer without Nextflow?

I read the current implementation all parallel processes that use the same container image run in the same container (irrespective $CH_IMAGE_STORAGE set). Wouldn't it be better to have one dedicated container per process?

Processes on the same node that are in different containers cannot use certain system calls to communicate, and this has performance impact. That’s the motivation for --join, though it’s not the default.

Note that multiple containers can be running the same image. I’d be skeptical of a separate image per process. That’d be a lot of images on a wide computer.

Another thing to consider is hpc/charliecloud#1408. This would be a supported way to create a container modified by a shell command (e.g., to add mount points).

One could also ch-convert, then manually adjust the resulting directory.

I hope that helps.

nschan commented 10 months ago

I could not reproduce the "stale file handle" outside nextflow, it might also have other, filesystem related reasons.

Processes on the same node that are in different containers cannot use certain system calls to communicate, and this has performance impact. That’s the motivation for --join, though it’s not the default.

I was referring to nextflow processes (tasks), which I think should anyway not communicate with each other while running.

What is not really clear to me, is what would happen if I have several (simultaneous) jobs running containers of the same image and for some reason input-specific files are created in a certain folder of the image (e.g. /opt/someprogram/somefile.tmp) with -w. As I understand all jobs see the what was written to /opt/program by other jobs. If the contents of somefile.tmp are specific for the task input then this could turn out to be somewhat of a problem, no?

If that could happen I think this would argue in favour of using ch-convert for each nextflow task. If that is not an issue, then I think it should be ok for nextflow to just store images in $workDir/charliecloud as it currently does when $CH_IMAGE_STORAGE and $NXF_CHARLIECLOUD_CACHEDIR are not set.

reidpr commented 10 months ago

Looks like I misunderstood. Thanks for clarifying.

nextflow-io / nextflow