Closed sadhanagaddam3 closed 1 year ago
Hi Sadhana,
Do you get the same out of memory error when running the Mesmer container outside of MCMICRO? Here is the set of instructions to apply the container directly to an image: https://github.com/vanvalenlab/deepcell-applications#using-docker
Hi Artem,
I can able to run the Mesmer container outside the MCMICRO. I tried the GPU version of deepcell-applications and it is able to generate the mask file.
Hi @sadhanagaddam3,
Can you try changing your params.yml
to the following:
workflow:
start-at: segmentation
stop-at: quantification
viz: true
segmentation: mesmer
options:
mesmer: --image-mpp 0.5 --compartment "both" --nuclear-channel 0 --membrane-image pilot_2.ome.tiff --membrane-channel 22 35 58 26 17 51 55 38
modules:
segmentation:
-
name: mesmer
version: 0.4.0-gpu
Notable changes:
--squeeze
and --nuclear-image
will be handled by MCMICRO, so they can be dropped--membrane-image
can refer directly to the image, since it will be staged by Nextflow in the work directory--membrane-channel
should not contain commas, if I'm reading Mesmer docs correctlyIf you are still having issue, can you please share the following:
.command.run
generated by Nextflow in the work directory (The path to the work directory will appear near the bottom of the error message).deepcell-applications
working outside MCMICROHi @ArtemSokolov , thanks for your timely advice. I have been working closely with @sadhanagaddam3 on the same project and will reply on her behalf on this issue. We tried using the params.yml
that you suggested and still encountered the error of making use of GPU.
(1) The output from bash .command.run
is shown here in detail:
WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_TMPDIR as environment variable will not be supported in the future, use APPTAINERENV_TMPDIR instead 2023-01-23 13:25:06.815433: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34) 2023-01-23 13:25:06.815545: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (sh02-ln01.stanford.edu ): /proc/driver/nvidia/version does not exist 2023-01-23 13:25:06.819901: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) t o use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. WARNING:tensorflow:No training configuration found in save file, so the model was not compiled. Compile it manually. [2023-01-23 13:25:16,618]:[WARNING]:[tensorflow]: No training configuration found in save file, so the model was not compiled. Compile it manually. /oak/stanford/groups/oro/nancyliy/CODEX/Nextflow/work/83/269e02e1afbaa5c3fad66dc481f99f/.command.sh: line 2: 200436 Killed python /usr/src/app/run_app .py mesmer --squeeze --output-directory . --output-name cell.tif --nuclear-image pilot_2.ome.tiff --image-mpp 0.5 --compartment "both" --nuclear-channel 0 --membrane-i mage pilot_2.ome.tiff --membrane-channel 22 35 58 26 17 51 55 38
(2) the command we ran to get deepcell-applications
working outside MCMICRO:
ml system libnvidia-container
export APPLICATION=mesmer
singularity run --nv deepcell-applications_latest-gpu.sif $APPLICATION --image-mpp 0.5 --nuclear-image /oak/stanford/groups/oro/Sadhana/Nancy/codex/CODEX_pilot/registration/pilot_2.ome.tiff --nuclear-channel 0 --membrane-image /oak/stanford/groups/oro/Sadhana/Nancy/codex/CODEX_pilot/registration/pilot_2.ome.tiff --membrane-channel 22 35 58 26 17 51 55 38 --output-directory output/ --output-name Mask_pilot2.tif --compartment "both"
Thank you for your kind input!
Hi @nancyliy. Sorry if my previous message was confusing. I was hoping to see the actual content of .command.run
, not the output of running it. Specifically, I am curious what the header looks (i.e., whether the first few lines have any SBATCH
directives) and how the singularity run
command inside .command.run
compares to what you ran manually.
One thing that catches my attention in your more recent message is that your manual run specified --mem-per-gpu=128G
, while the original issue listed --mem=64GB
in its run_nextflow.txt
. What happens if you provide MCMICRO with the same amount of memory as your manual Mesmer run or vice versa?
Hi @ArtemSokolov , Thank you for your prompt reply!
Here for your insights I display the actual content of .command.run
and .command.sh
respectively. Meanwhile I will try setting different amount of RAM as recommended by you to see what happens (I suspect somehow MCMICRO was not making use of the GPU resources allocated to it regardless of the memory allowance because it seems to have trouble finding the GPU device).
The following is the actual content of .command.run
. I printed it into a .txt file named command_run.txt
(attached)
The following is the actual content of .command.sh
python /usr/src/app/run_app.py mesmer --squeeze --output-directory . --output-name cell.tif --nuclear-image pilot_2.ome.tiff --image-mpp 0.5 --compartment "both" --nu clear-channel 0 --membrane-image pilot_2.ome.tiff --membrane-channel 22 35 58 26 17 51 55 38
Let me know how it goes with increased RAM.
I am not seeing anything problematic in .command.run
. Specifically, there are no directives modifying the amount of allocated memory, and the singularity exec
command matches what you ran by hand:
singularity exec -B /oak/stanford/groups/oro/nancyliy/CODEX/Nextflow --nv /oak/stanford/groups/oro/nancyliy/CODEX/Nextflow/work/singularity/vanvalenlab-deepcell-applications-0.4.0-gpu.img /bin/bash -ue /oak/stanford/groups/oro/nancyliy/CODEX/Nextflow/work/5c/f5cb9e26a710970815048c74882be4/.command.sh
Specifically, I see the --nv
flag, which exposes the GPU to be visible inside the container and -0.4.0-gpu
, which runs the GPU version of Mesmer. I believe the original problem with the GPU was due to MCMICRO running the non-GPU Mesmer container by default (which is now correctly overwritten by version: 0.4.0-gpu
).
So, yea, I would try to match the SBATCH
flags (specifically, --mem-per-gpu=128G
) between the MCMICRO and the manual Mesmer runs.
Hi @ArtemSokolov Thanks so much for your timely advice -- increasing RAM to 128 GB worked and segmentation run smoothly:) Another independent error occurred at the quantification:mcquant
stage -- I will open a separate issue for that:p Thanks so much this is much appreciated!
Glad to hear you got it working.
Hello,
I'm working on processing segmentation on a CODEX whole slide image using Mesmer. We have successfully been able to run the tool on tile but when I'm running it on the whole slide I'm ending up with the below error. I have set up a custom_config file to run it on GPU and edited the params.yml file as per our requirements, I'm attaching all files for your reference. Could you provide your input to fix the issue?
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=7107951.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Detailed error: _2023-01-17 14:18:21.761836: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs 2023-01-17 14:18:21.761915: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2023-01-17 14:18:21.766590: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. WARNING:tensorflow:No training configuration found in save file, so the model was not compiled. Compile it manually. [2023-01-17 14:18:31,062]:[WARNING]:[tensorflow]: No training configuration found in save file, so the model was not compiled. Compile it manually. .command.sh: line 2: 330 Killed python /usr/src/app/run_app.py mesmer --squeeze --output-directory . --output-name cell.tif --nuclear-image pilot_2.ome.tiff --squeeze
run_nextflow.txt custom_config.txt params.yml.txt Nextflow_segmentation.out.txt
Thanks, Sadhana