Running Topaz on HPC got RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

CFDavidHou commented 4 months ago

Dear Topaz Community,

I hope this message finds you well.

I wanted to share an update regarding my usage of Topaz within Relion 5 on our HPC server. While utilizing the module command to load Relion 5, I encountered an issue as Topaz wasn't preinstalled. To address this, I followed the advice given to install Topaz under my home directory, adhering closely to the installation instructions provided.

However, upon attempting to execute a Topaz picking or training job, I encountered an error message when specifying the Topaz location as ~/.conda/envs/topaz/bin/topaz:

+ Will use topaz for training a model 
 + Written out list of input training coordinates: AutoPick/job016/input_training_coords.star
 + Setting topaz downscale factor to 15 (assuming resnet8 model and 2*particle_diameter receptive box)
 + Setting topaz radius to 5 downscaled pixels (based on 25% of particle_diameter/2)
 + Using GPU device 0
 + Training with 79 picks in test set; and 315 picks in work set
 + By setting aside 4 micrographs for the test set 
# Loading model: resnet8
# Model parameters: units=32, dropout=0.0, bn=on
# Loading pretrained model: resnet8_u32
# Receptive field: 71
# Using device=0 with cuda=True
# Loaded 20 training micrographs with 315 labeled particles
# Loaded 4 test micrographs with 79 labeled particles
# source    split p_observed  num_positive_regions    total_regions
# 0   train 0.0203      25515 1254760
# 0   test  0.0255      6399  250952
# Specified expected number of particle per micrograph = 40.0
# With radius = 5
# Setting pi = 0.05164334215308107
# minibatch_size=256, epoch_size=1000, num_epochs=10
Traceback (most recent call last):
  File "/home/ch1225/.conda/envs/topaz/bin/topaz", line 33, in <module>
    sys.exit(load_entry_point('topaz-em==0.2.5', 'console_scripts', 'topaz')())
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/main.py", line 148, in main
    args.func(args)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/commands/train.py", line 695, in main
    , save_prefix=save_prefix, use_cuda=use_cuda, output=output)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/commands/train.py", line 577, in fit_epochs
    , use_cuda=use_cuda, output=output)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/commands/train.py", line 557, in fit_epoch
    metrics = step_method.step(X, Y)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/methods.py", line 103, in step
    score = self.model(X).view(-1)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/model/classifier.py", line 28, in forward
    z = self.features(x)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/model/features/resnet.py", line 54, in forward
    z = self.features(x)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/topaz/model/features/resnet.py", line 270, in forward
    y = self.conv(x)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ch1225/.conda/envs/topaz/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Is this more about communication with HPC and GPU nodes?

Any input is much appreciated!

Best regards,

David

tbepler commented 4 months ago

This looks like a possible GPU RAM issue. Sometimes CUDA gives weird errors like this when it runs out of GPU RAM. Is anything else running on the GPU at the same time?

tbepler commented 3 months ago

Closing this issue since it hasn't had any more discussion. @CFDavidHou feel free to reopen it if there is more to discuss.

tbepler / topaz

Running Topaz on HPC got RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #193