unable to find cuDNN algorithm to run convolution

smilesun commented 10 months ago

before controller: current mu: {'beta_d': 1e-06, 'beta_y': 1e-06, 'beta_x': 1e-06, 'gamma_d': 1e-06, 'mu_recon': 1e-06}¬                                                                                       
233 epo reg loss: [70404077.34736842, -1061.8578844572369, 0.0, -1058.1541947214228, 90.4799047369706]¬                                                                                                            
234 name reg loss:['mu_recon', 'beta_d', 'beta_x', 'beta_y', 'gamma_d']¬                                                                                                                                           
235 after contoller: current mu: {'beta_d': 1e-06, 'beta_y': 1e-06, 'beta_x': 1e-06, 'gamma_d': 1.0009052085015096e-06, 'mu_recon': 0.0001484131591025766}¬                                                        
236 [Sat Nov  4 08:39:49 2023]¬                                                                                                                                                                                    
237 Error in rule run_experiment:¬                                                                                                                                                                                 
238     jobid: 0¬                                                                                                                                                                                                  
239     input: zoutput/benchmarks/pacs_diva_fbopt_and_baselines/hyperparameters.csv¬                                                                                                                               
240     output: zoutput/benchmarks/pacs_diva_fbopt_and_baselines/rule_results/8.csv¬                                                                                                                               
241 ¬                                                                                                                                                                                                              
242 RuleException:¬                                                                                                                                                                                                
243 RuntimeError in file /home/aih/xudong.sun/domainlab_fbopt/domainlab/exp_protocol/benchmark.smk, line 151:¬                                                                                                     
244 Unable to find a valid cuDNN algorithm to run convolution¬                                                                                                                                                     
245   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/exp_protocol/benchmark.smk", line 151, in __rule_run_experiment¬                                                                                        
246   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/exp_protocol/run_experiment.py", line 153, in run_experiment¬                                                                                           
247   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/compos/exp/exp_main.py", line 71, in execute¬                                                                                                           
248   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/algos/trainers/train_fbopt_b.py", line 133, in tr_epoch¬                                                                                                
249   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/algos/trainers/train_basic.py", line 32, in tr_epoch¬                                                                                                   
250   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/algos/trainers/train_basic.py", line 56, in tr_batch¬                                                                                                   
251   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/models/a_model.py", line 43, in cal_loss¬                                                                                                               
252   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/models/a_model_classif.py", line 115, in cal_task_loss¬                                                                                                 
253   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/models/model_vae_xyd_classif.py", line 27, in cal_logit_y¬                                                                                              
254   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/compos/vae/compos/encoder_xyd_parallel.py", line 35, in infer_zy_loc¬                                                                                   
255   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl¬                                                                  
256   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/compos/vae/compos/encoder_zy.py", line 47, in forward¬                                                                                                  
257   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl¬                                                                  
258   File "/home/aih/xudong.sun/domainlab_fbopt/domainlab/compos/nn_zoo/nn_torchvision.py", line 21, in forward¬                                                                                                  
259   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl¬                                                                  
260   File "/home/aih/xudong.sun/domainlab_fbopt/examples/nets/resnet50domainbed.py", line 46, in forward¬                                                                                                         
261   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl¬                                                                  
262   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward¬                                                                   
263   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl¬                                                                  
264   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward¬                                                                   
265   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl¬                                                                  
266   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torchvision/models/resnet.py", line 146, in forward¬                                                                    
267   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl¬                                                                  
268   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward¬                                                                        
269   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward¬                                                                  
270   File "/home/aih/xudong.sun/anaconda3/envs/domainlab_py39/lib/python3.9/concurrent/futures/thread.py", line 58, in run¬                                                                                       
271 Shutting down, this might take some time.¬                                                                                                                                                                     
272 Exiting because a job execution failed. Look above for error message¬                                                                                                                                          
 VISUAL  ᚠ fboptɆ  run_experiment-index=8-14154294.err

smilesun commented 10 months ago

Docstring for class torchvision.transforms.transforms.RandomGrayscale
=====================================================================
RandomGrayscale(p=0.1)

Randomly convert image to grayscale with a probability of p (default 0.1).
If the image is torch Tensor, it is expected
to have [..., 3, H, W] shape, where ... means an arbitrary number of leading dimensions

Args:
    p (float): probability that image should be converted to grayscale.

Returns:
    PIL Image or Tensor: Grayscale version of the input image with probability p and unchanged
    with probability (1-p).
    - If input image is 1 channel: grayscale version is 1 channel
    - If input image is 3 channel: grayscale version is 3 channel with r == g == b

agisga commented 10 months ago

Here it says that this error ("Unable to find a valid cuDNN algorithm to run convolution") can occur when you run out of GPU memory. Could you maybe try to reduce the batch size and see if you still get the error?

smilesun commented 10 months ago

it looks like you are right, it depends on batch size

marrlab / DomainLab

unable to find cuDNN algorithm to run convolution #629