labsyspharm / mcmicro

Multiple-choice microscopy pipeline
https://mcmicro.org/
MIT License
107 stars 58 forks source link

out-of-memory issue when trying to run using GPU #472

Closed sadhanagaddam3 closed 1 year ago

sadhanagaddam3 commented 1 year ago

Hello,

I'm working on processing segmentation on a CODEX whole slide image using Mesmer. We have successfully been able to run the tool on tile but when I'm running it on the whole slide I'm ending up with the below error. I have set up a custom_config file to run it on GPU and edited the params.yml file as per our requirements, I'm attaching all files for your reference. Could you provide your input to fix the issue?

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=7107951.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Detailed error: _2023-01-17 14:18:21.761836: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs 2023-01-17 14:18:21.761915: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2023-01-17 14:18:21.766590: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. WARNING:tensorflow:No training configuration found in save file, so the model was not compiled. Compile it manually. [2023-01-17 14:18:31,062]:[WARNING]:[tensorflow]: No training configuration found in save file, so the model was not compiled. Compile it manually. .command.sh: line 2: 330 Killed python /usr/src/app/run_app.py mesmer --squeeze --output-directory . --output-name cell.tif --nuclear-image pilot_2.ome.tiff --squeeze

run_nextflow.txt custom_config.txt params.yml.txt Nextflow_segmentation.out.txt

Thanks, Sadhana

ArtemSokolov commented 1 year ago

Hi Sadhana,

Do you get the same out of memory error when running the Mesmer container outside of MCMICRO? Here is the set of instructions to apply the container directly to an image: https://github.com/vanvalenlab/deepcell-applications#using-docker

sadhanagaddam3 commented 1 year ago

Hi Artem,

I can able to run the Mesmer container outside the MCMICRO. I tried the GPU version of deepcell-applications and it is able to generate the mask file.

ArtemSokolov commented 1 year ago

Hi @sadhanagaddam3,

Can you try changing your params.yml to the following:

workflow:
  start-at: segmentation
  stop-at: quantification
  viz: true
  segmentation: mesmer
options:
  mesmer: --image-mpp 0.5 --compartment "both" --nuclear-channel 0 --membrane-image pilot_2.ome.tiff --membrane-channel 22 35 58 26 17 51 55 38
modules:
  segmentation:
    -
      name: mesmer
      version: 0.4.0-gpu

Notable changes:

  1. All Mesmer options should be provided under a single YAML header
  2. Tell MCMICRO to use the GPU version of the container
  3. --squeeze and --nuclear-image will be handled by MCMICRO, so they can be dropped
  4. --membrane-image can refer directly to the image, since it will be staged by Nextflow in the work directory
  5. --membrane-channel should not contain commas, if I'm reading Mesmer docs correctly

If you are still having issue, can you please share the following:

nancyliy commented 1 year ago

Hi @ArtemSokolov , thanks for your timely advice. I have been working closely with @sadhanagaddam3 on the same project and will reply on her behalf on this issue. We tried using the params.yml that you suggested and still encountered the error of making use of GPU.

(1) The output from bash .command.run is shown here in detail:

WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_TMPDIR as environment variable will not be supported in the future, use APPTAINERENV_TMPDIR instead 2023-01-23 13:25:06.815433: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34) 2023-01-23 13:25:06.815545: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (sh02-ln01.stanford.edu ): /proc/driver/nvidia/version does not exist 2023-01-23 13:25:06.819901: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) t o use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. WARNING:tensorflow:No training configuration found in save file, so the model was not compiled. Compile it manually. [2023-01-23 13:25:16,618]:[WARNING]:[tensorflow]: No training configuration found in save file, so the model was not compiled. Compile it manually. /oak/stanford/groups/oro/nancyliy/CODEX/Nextflow/work/83/269e02e1afbaa5c3fad66dc481f99f/.command.sh: line 2: 200436 Killed python /usr/src/app/run_app .py mesmer --squeeze --output-directory . --output-name cell.tif --nuclear-image pilot_2.ome.tiff --image-mpp 0.5 --compartment "both" --nuclear-channel 0 --membrane-i mage pilot_2.ome.tiff --membrane-channel 22 35 58 26 17 51 55 38

(2) the command we ran to get deepcell-applications working outside MCMICRO:

!/bin/bash

SBATCH -p gpu --mem-per-gpu=128G --cpus-per-gpu=2 -G 1 --job-name=segmentation --output=segmentation.out --error=segmentation.err --time=3:00:00 --qos=normal

ml system libnvidia-container

export APPLICATION=mesmer

singularity run --nv deepcell-applications_latest-gpu.sif $APPLICATION --image-mpp 0.5 --nuclear-image /oak/stanford/groups/oro/Sadhana/Nancy/codex/CODEX_pilot/registration/pilot_2.ome.tiff --nuclear-channel 0 --membrane-image /oak/stanford/groups/oro/Sadhana/Nancy/codex/CODEX_pilot/registration/pilot_2.ome.tiff --membrane-channel 22 35 58 26 17 51 55 38 --output-directory output/ --output-name Mask_pilot2.tif --compartment "both"

Thank you for your kind input!

ArtemSokolov commented 1 year ago

Hi @nancyliy. Sorry if my previous message was confusing. I was hoping to see the actual content of .command.run, not the output of running it. Specifically, I am curious what the header looks (i.e., whether the first few lines have any SBATCH directives) and how the singularity run command inside .command.run compares to what you ran manually.

One thing that catches my attention in your more recent message is that your manual run specified --mem-per-gpu=128G, while the original issue listed --mem=64GB in its run_nextflow.txt. What happens if you provide MCMICRO with the same amount of memory as your manual Mesmer run or vice versa?

nancyliy commented 1 year ago

Hi @ArtemSokolov , Thank you for your prompt reply!

Here for your insights I display the actual content of .command.run and .command.sh respectively. Meanwhile I will try setting different amount of RAM as recommended by you to see what happens (I suspect somehow MCMICRO was not making use of the GPU resources allocated to it regardless of the memory allowance because it seems to have trouble finding the GPU device).

The following is the actual content of .command.run. I printed it into a .txt file named command_run.txt (attached)

command_run.txt

The following is the actual content of .command.sh

!/bin/bash -ue

python /usr/src/app/run_app.py mesmer --squeeze --output-directory . --output-name cell.tif --nuclear-image pilot_2.ome.tiff --image-mpp 0.5 --compartment "both" --nu clear-channel 0 --membrane-image pilot_2.ome.tiff --membrane-channel 22 35 58 26 17 51 55 38

ArtemSokolov commented 1 year ago

Let me know how it goes with increased RAM.

I am not seeing anything problematic in .command.run. Specifically, there are no directives modifying the amount of allocated memory, and the singularity exec command matches what you ran by hand:

singularity exec -B /oak/stanford/groups/oro/nancyliy/CODEX/Nextflow --nv /oak/stanford/groups/oro/nancyliy/CODEX/Nextflow/work/singularity/vanvalenlab-deepcell-applications-0.4.0-gpu.img /bin/bash -ue /oak/stanford/groups/oro/nancyliy/CODEX/Nextflow/work/5c/f5cb9e26a710970815048c74882be4/.command.sh

Specifically, I see the --nv flag, which exposes the GPU to be visible inside the container and -0.4.0-gpu, which runs the GPU version of Mesmer. I believe the original problem with the GPU was due to MCMICRO running the non-GPU Mesmer container by default (which is now correctly overwritten by version: 0.4.0-gpu).

So, yea, I would try to match the SBATCH flags (specifically, --mem-per-gpu=128G) between the MCMICRO and the manual Mesmer runs.

nancyliy commented 1 year ago

Hi @ArtemSokolov Thanks so much for your timely advice -- increasing RAM to 128 GB worked and segmentation run smoothly:) Another independent error occurred at the quantification:mcquant stage -- I will open a separate issue for that:p Thanks so much this is much appreciated!

ArtemSokolov commented 1 year ago

Glad to hear you got it working.