Caching segmentation models

MDHowe4 commented 11 months ago

On our cluster, compute nodes do not have internet access. I can download and cache all of the container images easily either manually or by doing a dry run on a head node, but I am scratching my head as to how to cache the Tissuenet segmentation models for use with with Mesmer/Cellpose/etc since they are downloaded when the containers are being used in the pipeline. This isn't a problem when I've used these containers outside Nextflow since they just get cached in $HOME. The way Nextflow works as I understand it the models will be cached in the Nextflow working directory work, however, if this directory changes then you have to download the model again, which is a problem. My issue here seems similar to labsyspharm/mcmicro/issues/515 , but I am hoping there is an answer here that doesn't involve hardcoding the container paths, rebuilding the container and deploying it. I wonder if defining a permanent NXF_WORK directory will work, but I worry that cleaning the working directory will require it to be downloaded again. Maybe someone here has overcome this problem before?

ArtemSokolov commented 11 months ago

Hi @MDHowe4,

I believe this happens because we are formally setting the work directory to be $HOME inside the containers: https://github.com/labsyspharm/mcmicro/blob/master/config/nf/singularity.config#L3

The change was introduced in #453, but I don't remember what was breaking that required this.

Can you try creating a custom.config and adding this line to it to reset $HOME back to its location on the host system?

singularity.runOptions = '-C -H "$HOME"'

You can pass this config to nextflow with

nextflow run labsyspharm/mcmicro -c custom.config --in myproject -profile singularity

If it can't find the input files, you can also try explicitly exposing the work directory (without setting it to be $HOME) with:

singularity.runOptions = '-C -H "$HOME" -B "$PWD"'

If it's still failing after that, can you post the output of grep singularity .command.run from the work directory of the failing process?

MDHowe4 commented 10 months ago

Thank you for the help @ArtemSokolov,

Both commands resulted in this error:

Caused by:
  Process registration:ashlar terminated with an error exit status (1)

Command executed:

  ashlar 'exemplar-001-cycle-06.ome.tiff' 'exemplar-001-cycle-07.ome.tiff' 'exemplar-001-cycle-08.ome.tiff'  -m 30 --ffp exemplar-001-cycle-06-ffp.tif exemplar-001-cycle-07-ffp.tif exemplar-001-cycle-08-ffp.tif --dfp exemplar-001-cycle-06-dfp.tif exemplar-001-cycle-07-dfp.tif exemplar-001-cycle-08-dfp.tif -o exemplar-001.ome.tif

Command exit status:
  1

Command output:
  Stitching and registering input images
  Cycle 0:
      reading exemplar-001-cycle-06.ome.tiff

Command error:
  INFO:    squashfuse not found, will not be able to mount SIF
  INFO:    Converting SIF file to temporary sandbox...
  Traceback (most recent call last):
    File "/usr/local/bin/ashlar", line 8, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.10/dist-packages/ashlar/scripts/ashlar.py", line 212, in main
      return process_single(
    File "/usr/local/lib/python3.10/dist-packages/ashlar/scripts/ashlar.py", line 237, in process_single
      reader = build_reader(filepaths[0], plate_well=plate_well)
    File "/usr/local/lib/python3.10/dist-packages/ashlar/scripts/ashlar.py", line 358, in build_reader
      reader = reader_class(path, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/ashlar/reg.py", line 398, in __init__
      self.metadata = BioformatsMetadata(self.path)
    File "/usr/local/lib/python3.10/dist-packages/ashlar/reg.py", line 210, in __init__
      self._init_metadata()
    File "/usr/local/lib/python3.10/dist-packages/ashlar/reg.py", line 234, in _init_metadata
      self._reader.setId(self.path)
    File "jnius/jnius_export_class.pxi", line 1177, in jnius.JavaMultipleMethod.__call__
    File "jnius/jnius_export_class.pxi", line 885, in jnius.JavaMethod.__call__
    File "jnius/jnius_export_class.pxi", line 982, in jnius.JavaMethod.call_method
    File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
  jnius.JavaException: JVM exception occurred: exemplar-001-cycle-06.ome.tiff (No such file or directory) java.io.FileNotFoundException
  INFO:    Cleaning up image...

Work dir:
  /research/labs/hematology/hemedata/projects/spatial/mcmicro_test/work/bf/6d4d46e4f192133d2acdb9c05047ac

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

The output of grep apptainer .command.run was (we have apptainer on our cluster):

set +u; env - PATH="$PATH" ${TMP:+APPTAINERENV_TMP="$TMP"} ${TMPDIR:+APPTAINERENV_TMPDIR="$TMPDIR"} 
APPTAINERENV_NXF_DEBUG=${NXF_DEBUG:=0} apptainer exec --pid -B
/research/labs/hematology/hemedata/projects/spatial/mcmicro_test -C --unsquash -H "$HOME" -B "$PWD"
 /research/labs/hematology/hemedata/apptainer/containers/labsyspharm-ashlar-1.17.0.img /bin/bash
 /research/labs/hematology/hemedata/projects/spatial/mcmicro_test/work/bf/6d4d46e4f192133d2acdb9c05047ac/.c
ommand.run nxf_trace

For the sake of posterity this is my submission/config. Previous to our discussion I was running apptainer.runOptions = '-C --unsquash -H "$PWD"' which is able to find the exemplar-001 images.

config:

// Execution environment for miscellaneous tasks
params.roadie = 'labsyspharm/roadie:2023-03-08'

report.overwrite     = true
apptainer.enabled    = true
apptainer.autoMounts = true
apptainer.runOptions = '-C --unsquash -H "$HOME" -B "$PWD"'
params.contPfx       = 'docker://'

process{
  executor = 'slurm'
  queue    = 'cpu-short'
  cpus     = 4
  time     = '6h'
  memory   = '64GB'
}

submission:

#!/bin/sh

#SBATCH --partition=cpu-short
#SBATCH -J=mcmicro_test1
#SBATCH -o mcmicro-%J.log
#SBATCH --time=1:00:00
#SBATCH --mem=8G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=
#SBATCH --chdir /research/labs/hematology/hemedata/projects/spatial/mcmicro_test

module purge
module load nextflow
module load apptainer

export NXF_APPTAINER_CACHEDIR="/research/labs/hematology/hemedata/apptainer/containers"

nextflow -C testconf_int.config run labsyspharm/mcmicro --in exemplar-001 -with-report metrics.html

ArtemSokolov commented 10 months ago

Yea, I was afraid of that.

Let me play around with it this week, but one other quick thing to try is:

apptainer.runOptions = '-C --unsquash -H "$PWD" -B "$HOME/.deepcell":"$PWD/.deepcell"'

This will set the home directory to be work as before, but it will also expose the host machine's DeepCell models directory as .deepcell inside work. Hopefully, this would then properly resolve to point to the correct location on the host machine.

MDHowe4 commented 10 months ago

It does in fact properly resolve to point exactly where needed. I believe this issue is now solved. I was able to get Mesmer to work with .keras and Cellpose to work with .cellpose folders containing models with the correct directory structures and pre-downloaded models located in my $HOME directory. I was able to segment and quantify exemplar-002 on our cluster with Mesmer using MCMICRO, but had issues with the Cellpose pre-trained tissuenet model, but not the cyto model when running MCMICRO. The segmentation masks are empty and there is no quantification output for cells. I will need to troubleshoot this a bit to see if this is actually problematic behavior or an issue on my end. This behavior is not related to the cluster as it also fails on exemplar-001 on my local machine. Regardless, I can now run MCMICRO reproducibly on our cluster with these containers. Thank you very much for the help Artem.

ArtemSokolov commented 10 months ago

Great to hear that you got the containers working!

I'm going to close this issue, but feel free to open a new one if you spot a bug related to how MCMICRO uses Cellpose.

labsyspharm / mcmicro

Caching segmentation models #527