Pepper SNP training fails on DGX1 server

GuillaumeHolley commented 2 years ago

Hi hi,

So I am in the midst of training Pepper SNP on my HG002 corrected with Ratatosk. I am currently running step 4 which is the training step itself. I have a machine with 2 GPUs and it is running fine but given 1000 epochs, I estimate the total wall-clock time to be just under 3 days. I tried to accelerate this by running the same step on a DGX1 server with 8 GPUs and I get the following error:

[01-05-2022 09:29:59] INFO: TRAIN MODEL MODULE SELECTED
[01-05-2022 09:30:00] INFO: TOTAL GPU AVAILABLE: 8
[01-05-2022 09:30:00] INFO: AVAILABLE GPU DEVICES: [0, 1, 2, 3, 4, 5, 6, 7]
[01-05-2022 09:30:00] ERROR: PEPPER HP TRAINING INITIATED.
[01-05-2022 09:30:00] INFO: LOADING DATA
[01-05-2022 09:30:00] INFO: STARTING TO LOAD IMAGES.
[01-05-2022 09:34:43] INFO: IMAGE LOADING FINISHED.
[01-05-2022 09:34:43] [ELAPSED TIME: 4 Min 43 Sec]
[01-05-2022 09:34:43] INFO: TOTAL TRAINABLE PARAMETERS: 8730655
[01-05-2022 09:34:43] INFO: MODEL OPTIMIZER PARAMETERS:
[01-05-2022 09:34:43] INFO: LEARNING RATE: 0.001
[01-05-2022 09:34:43] INFO: WEIGHT DECAY: 1e-05
<REDACTED>/pepper-r0.7/venv/lib/python3.8/site-packages/torch/cuda/__init__.py:106: UserWarning: 
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "<REDACTED>/pepper-r0.7/venv/bin/pepper_variant_train", line 33, in <module>
    sys.exit(load_entry_point('pepper-polish==0.7.5', 'console_scripts', 'pepper_variant_train')())
  File "<REDACTED>/pepper-r0.7/venv/lib/python3.8/site-packages/pepper_polish-0.7.5-py3.8-linux-x86_64.egg/pepper_variant/pepper_variant_train.py", line 336, in main
    train_pepper_model(options)
  File "<REDACTED>/pepper-r0.7/venv/lib/python3.8/site-packages/pepper_polish-0.7.5-py3.8-linux-x86_64.egg/pepper_variant/modules/python/TrainModule.py", line 167, in train_pepper_model
    tm.train_model_distributed(options.device_ids, options.callers_per_gpu)
  File "<REDACTED>/pepper-r0.7/venv/lib/python3.8/site-packages/pepper_polish-0.7.5-py3.8-linux-x86_64.egg/pepper_variant/modules/python/TrainModule.py", line 99, in train_model_distributed
    train_distributed(self.train_file,
  File "<REDACTED>/pepper-r0.7/venv/lib/python3.8/site-packages/pepper_polish-0.7.5-py3.8-linux-x86_64.egg/pepper_variant/modules/python/models/train_distributed.py", line 366, in train_distributed
    train(train_file, test_file, batch_size, test_batch_size, step_size, epochs, gpu_mode, num_workers, retrain_model, retrain_model_path,
  File "<REDACTED>/pepper-r0.7/venv/lib/python3.8/site-packages/pepper_polish-0.7.5-py3.8-linux-x86_64.egg/pepper_variant/modules/python/models/train_distributed.py", line 129, in train
    transducer_model = transducer_model.cuda()
  File "<REDACTED>/pepper-r0.7/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 637, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "<REDACTED>/pepper-r0.7/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  File "<REDACTED>/pepper-r0.7/venv/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 189, in _apply
    self.flatten_parameters()
  File "<REDACTED>/pepper-r0.7/venv/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 175, in flatten_parameters
    torch._cudnn_rnn_flatten_weight(
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Seems like the TensorFlow package doesn't handle this GPU architecture which seems weird to me.

Thanks, Guillaume

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

So this message:

A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.

Can you please let me know the version of pytorch you have in your system?

GuillaumeHolley commented 2 years ago

Unfortunately this machine is rather busy and I couldn't get access to it again. I believe the link you provided is the solution so I will close this for now. Thanks for the help!

kishwarshafin / pepper

Pepper SNP training fails on DGX1 server #121