Mesmer flaky instantiation on HPC cluster

colganwi commented 2 months ago

Describe the bug It seems like Mesmer reads the TF SavedModel in write mode which means that multiple processes cannot load Mesmer simultaneously. This results in flaky instantiation when running Mesmer in parallel on a HPC cluster.

To Reproduce Run the code below with >20 cores. If one core is currently loading Mesmer other cores will throw Read less bytes than requested or a number of other errors.

Code:

from deepcell.applications import Mesmer
attempts = 10
model = None
for attempt in range(attempts):
    try:
        model = Mesmer()
        break  # If successful, exit the loop
    except Exception as e:
        print(f"Attempt {attempt + 1} failed: {e}")
        time.sleep(10) 
if model is None:
    print("Failed to initialize the Mesmer after 10 attempts.")
else:
    print("Model initialized successfully.")

Running:

#!/bin/bash
# Configuration values for SLURM job submission.
#SBATCH --job-name=mesmer
#SBATCH --nodes=1 
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=8gb
#SBATCH --array=1-400%50

FOV=$(($SLURM_ARRAY_TASK_ID - 1))
echo "FOV: ${FOV}"

source activate deepcell-env
python run_mesmer.pu

Error:

INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:27:00.092152: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wcolgan/miniconda3/envs/py10-env/lib/python3.10/site-packages/cv2/../../lib64:
2024-09-19 08:27:00.092216: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2024-09-19 08:27:00.092255: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (c3b7): /proc/driver/nvidia/version does not exist
2024-09-19 08:27:00.092772: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-19 08:27:10.302641: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:27:31.767108: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:28:14.570058: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
Attempt 1 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 2 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 3 failed: Read less bytes than requested
Attempt 4 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 5 failed: Read less bytes than requested
Attempt 6 failed: Read less bytes than requested
Attempt 7 failed: Read less bytes than requested
Model initialized successfully.

Expected behavior Initiating Mesmer should be reliable and not include any file locks or write operations

Desktop (please complete the following information):

OS: Linux c4b2 5.4.0-137-generic 154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Python Version: 3.10.13

Additional context Add any other context about the problem here.

rossbar commented 2 months ago

Thanks for reporting - this indeed does look like a real issue related to the authentication layer. The model downloading component implements a simple cache, but the model extraction component doesn't - so what I think is happening is that the .tar.gz is being extracted in every run, which can definitely cause issues if one process is reading while another is overwriting with a newly extracted stream.

I suspect the most straightforward fix would be to add caching to the model extraction piece as well.

colganwi commented 2 months ago

Thanks for looking into this. Let me know when you have a patch. For now, I'm able to work around it my not running to many jobs in parallel and using the try-catch above.

vanvalenlab / deepcell-tf

Mesmer flaky instantiation on HPC cluster #733