cublas runtime error - Githubissues

lars-at-styx commented 3 years ago

I'm following the readme to try and Finetune HMNet on the AMI dataset. My only modification to the instructions is that I have only 1 visible device (my full command thus becomes CUDA_VISIBLE_DEVICES="0" mpirun -np 1 --allow-run-as-root python PyLearn.py train ExampleConf/conf_hmnet_AMI).

The process exits with an error.

Here's the full output.

{'MODEL': 'MeetingNet_Transformer', 'TASK': 'HMNet', 'CRITERION': 'MLECriterion', 'SEED': 1033, 'RESUME': True, 'MAX_NUM_EPOCHS': 20, 'SAVE_PER_UPDATE_NUM': 400, 'UPDATES_PER_EPOCH': 2000, 'OPTIMIZER': 'RAdam', 'NO_AUTO_LR_SCALING': True, 'START_LEARNING_RATE': 0.001, 'LR_SCHEDULER': 'LnrWrmpInvSqRtDcyScheduler', 'WARMUP_STEPS': 16000, 'WARMUP_INIT_LR': 0.0001, 'WARMUP_END_LR': 0.001, 'GRADIENT_ACCUMULATE_STEP': 20, 'GRAD_CLIPPING': 2, 'USE_REL_DATA_PATH': True, 'TRAIN_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/train_ami.json', 'DEV_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/valid_ami.json', 'TEST_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/test_ami.json', 'ROLE_DICT_FILE': '../ExampleRawData/meeting_summarization/role_dict_ext.json', 'MINI_BATCH': 1, 'MAX_PADDING_RATIO': 1, 'BATCH_READ_AHEAD': 10, 'DOC_SHUFFLE_BUF_SIZE': 10, 'SAMPLE_SHUFFLE_BUFFER_SIZE': 10, 'BATCH_SHUFFLE_BUFFER_SIZE': 10, 'MAX_TRANSCRIPT_WORD': 8300, 'MAX_SENT_LEN': 30, 'MAX_SENT_NUM': 300, 'DROPOUT': 0.1, 'VOCAB_DIM': 512, 'ROLE_SIZE': 32, 'ROLE_DIM': 16, 'POS_DIM': 16, 'ENT_DIM': 16, 'USE_ROLE': True, 'USE_POSENT': True, 'USE_BOS_TOKEN': True, 'USE_EOS_TOKEN': True, 'TRANSFORMER_EMBED_DROPOUT': 0.1, 'TRANSFORMER_RESIDUAL_DROPOUT': 0.1, 'TRANSFORMER_ATTENTION_DROPOUT': 0.1, 'TRANSFORMER_LAYER': 6, 'TRANSFORMER_HEAD': 8, 'TRANSFORMER_POS_DISCOUNT': 80, 'PRE_TOKENIZER': 'TransfoXLTokenizer', 'PRE_TOKENIZER_PATH': '../ExampleInitModel/transfo-xl-wt103', 'PYLEARN_MODEL': '../ExampleInitModel/HMNet-pretrained', 'EXTRA_IDS': 1000, 'BEAM_WIDTH': 6, 'MAX_GEN_LENGTH': 512, 'MIN_GEN_LENGTH': 320, 'EVAL_TOKENIZED': True, 'EVAL_LOWERCASE': True, 'NO_REPEAT_NGRAM_SIZE': 3, 'cuda': True, 'confFile': 'ExampleConf/conf_hmnet_AMI', 'datadir': 'ExampleConf', 'basename': 'conf_hmnet_AMI', 'command': 'train', 'conf_file': 'ExampleConf/conf_hmnet_AMI', 'cluster': 'local', 'dist_init_path': './tmp', 'fp16': False, 'fp16_opt_level': 'O1', 'no_cuda': False}
Using Cuda

Saving logs, model, checkpoint, and evaluation in ExampleConf/conf_hmnet_AMI_conf~/run_2
 1.2.0  is high
Number of GPUs is  1 
Effective batch size is increased from  1  to  1 
Gradient accumulation steps =  20 
Effective batch size =  20 
[9d66c296629d:03515] pml_ucx.c:285  Error: UCP worker does not support MPI_THREAD_MULTIPLE
Select command: train
train on rank 0
-----------------------------------------------
Initializing model...
Loading Tokenizer from ExampleConf/../ExampleInitModel/transfo-xl-wt103...
Using pad_token, but it is not set yet.
Using bos_token, but it is not set yet.
Use POS and ENT
USE_ROLE

Total trainable parameters: 204488240
Loaded data on rank 0.
Using custom optimizer: RAdam
Optimizer parameters: {'lr': 0.001}
Using custom lr scheduler: LnrWrmpInvSqRtDcyScheduler
Lr scheduler parameters: {'warmup_steps': 16000, 'warmup_init_lr': 0.0001, 'warmup_end_lr': 0.001}
Cannot find checkpoint path from conf_hmnet_AMI_resume_checkpoint.json.
Make sure ExampleConf/conf_hmnet_AMI_resume_checkpoint.json exists.
Continue without loading checkpoint
Epoch 0
Traceback (most recent call last):
  File "PyLearn.py", line 71, in <module>
    trainer.train()
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train
    self.update(batch)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update
    loss = self.network(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward
    output = self.model(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 100, in forward
    outputs = self._forward(**batch)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 125, in _forward
    token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1130, in forward
    embedded = self.embedder(vocab_x.view(batch_size, -1))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward
    x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/Transformer.py", line 86, in forward
    sinusoid_inp = torch.ger(pos_seq, self.inv_freq)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:120
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42501,1],0]
  Exit code:    1
--------------------------------------------------------------------------

rgowtham commented 3 years ago

I am facing the same issue as above. I am trying to run this on a mac with no gpu. My command and output is as follows,

root@c85979e176ac:~/HMNet# python PyLearn.py train ExampleConf/conf_hmnet_AMI --no_cuda
{'MODEL': 'MeetingNet_Transformer', 'TASK': 'HMNet', 'CRITERION': 'MLECriterion', 'SEED': 1033, 'MAX_NUM_EPOCHS': 20, 'SAVE_PER_UPDATE_NUM': 400, 'UPDATES_PER_EPOCH': 2000, 'OPTIMIZER': 'RAdam', 'NO_AUTO_LR_SCALING': True, 'START_LEARNING_RATE': 0.001, 'LR_SCHEDULER': 'LnrWrmpInvSqRtDcyScheduler', 'WARMUP_STEPS': 16000, 'WARMUP_INIT_LR': 0.0001, 'WARMUP_END_LR': 0.001, 'GRADIENT_ACCUMULATE_STEP': 20, 'GRAD_CLIPPING': 2, 'USE_REL_DATA_PATH': True, 'TRAIN_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/train_ami.json', 'DEV_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/valid_ami.json', 'TEST_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/test_ami.json', 'ROLE_DICT_FILE': '../ExampleRawData/meeting_summarization/role_dict_ext.json', 'MINI_BATCH': 1, 'MAX_PADDING_RATIO': 1, 'BATCH_READ_AHEAD': 10, 'DOC_SHUFFLE_BUF_SIZE': 10, 'SAMPLE_SHUFFLE_BUFFER_SIZE': 10, 'BATCH_SHUFFLE_BUFFER_SIZE': 10, 'MAX_TRANSCRIPT_WORD': 8300, 'MAX_SENT_LEN': 30, 'MAX_SENT_NUM': 300, 'DROPOUT': 0.1, 'VOCAB_DIM': 512, 'ROLE_SIZE': 32, 'ROLE_DIM': 16, 'POS_DIM': 16, 'ENT_DIM': 16, 'USE_ROLE': True, 'USE_POSENT': True, 'USE_BOS_TOKEN': True, 'USE_EOS_TOKEN': True, 'TRANSFORMER_EMBED_DROPOUT': 0.1, 'TRANSFORMER_RESIDUAL_DROPOUT': 0.1, 'TRANSFORMER_ATTENTION_DROPOUT': 0.1, 'TRANSFORMER_LAYER': 6, 'TRANSFORMER_HEAD': 8, 'TRANSFORMER_POS_DISCOUNT': 80, 'PRE_TOKENIZER': 'TransfoXLTokenizer', 'PRE_TOKENIZER_PATH': '../ExampleInitModel/transfo-xl-wt103', 'PYLEARN_MODEL': '../ExampleInitModel/HMNet-pretrained', 'EXTRA_IDS': 1000, 'BEAM_WIDTH': 6, 'MAX_GEN_LENGTH': 512, 'MIN_GEN_LENGTH': 320, 'EVAL_TOKENIZED': True, 'EVAL_LOWERCASE': True, 'NO_REPEAT_NGRAM_SIZE': 3, 'cuda': False, 'confFile': 'ExampleConf/conf_hmnet_AMI', 'datadir': 'ExampleConf', 'basename': 'conf_hmnet_AMI', 'command': 'train', 'conf_file': 'ExampleConf/conf_hmnet_AMI', 'cluster': 'local', 'dist_init_path': './tmp', 'fp16': False, 'fp16_opt_level': 'O1', 'no_cuda': True}
Using CPU

Saving logs, model, checkpoint, and evaluation in ExampleConf/conf_hmnet_AMI_conf~/run_12
 1.2.0  is high
Number of GPUs is  1 
Effective batch size is increased from  1  to  1 
Gradient accumulation steps =  20 
Effective batch size =  20 
[c85979e176ac:00029] pml_ucx.c:285  Error: UCP worker does not support MPI_THREAD_MULTIPLE
Select command: train
train on rank 0
-----------------------------------------------
Initializing model...
Loading Tokenizer from ExampleConf/../ExampleInitModel/transfo-xl-wt103...
Using pad_token, but it is not set yet.
Using bos_token, but it is not set yet.
Use POS and ENT
USE_ROLE
Total trainable parameters: 204488240
Loaded data on rank 0.
Using custom optimizer: RAdam
Optimizer parameters: {'lr': 0.001}
Using custom lr scheduler: LnrWrmpInvSqRtDcyScheduler
Lr scheduler parameters: {'warmup_steps': 16000, 'warmup_init_lr': 0.0001, 'warmup_end_lr': 0.001}
Epoch 0
Killed

I am specifying no_cuda, but it still says no. of gpu is 1. And also it does not give a clear error msg on where it is failing. Can someone help by looking into this.

irenebenedetto commented 3 years ago

When I tried to reproduce the results I noticed that in Transformer class there are some variables which have .cuda() not controlled by the option opt['cuda'] . Did you try to modify them?

rgowtham commented 3 years ago

Hi @irenebenedetto, yes - before making those cuda related changes, the error msg was something like below,

Epoch 0
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=35 : CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
  File "PyLearn.py", line 71, in <module>
    trainer.train()
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train
    self.update(batch)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update
    loss = self.network(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward
    output = self.model(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 101, in forward
    outputs = self._forward(**batch)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 126, in _forward
    token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1131, in forward
    embedded = self.embedder(vocab_x.view(batch_size, -1))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward
    x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state
RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/aten/src/THC/THCGeneral.cpp:50

Once I remove the .cuda() parts from here, here and here, I get the error msg as I have posted above. I was expecting to see more of the cuda related error if the code still tries to access gpu.

Were you able to get it running after you changed all the places where .cuda() was used?

irenebenedetto commented 3 years ago

Hi @irenebenedetto, yes - before making those cuda related changes, the error msg was something like below,

Epoch 0
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=35 : CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
  File "PyLearn.py", line 71, in <module>
    trainer.train()
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train
    self.update(batch)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update
    loss = self.network(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward
    output = self.model(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 101, in forward
    outputs = self._forward(**batch)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 126, in _forward
    token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1131, in forward
    embedded = self.embedder(vocab_x.view(batch_size, -1))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward
    x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state
RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/aten/src/THC/THCGeneral.cpp:50

Once I remove the .cuda() parts from here, here and here, I get the error msg as I have posted above. I was expecting to see more of the cuda related error if the code still tries to access gpu.

Were you able to get it running after you changed all the places where .cuda() was used?

Looking at the error message I see other variables on cuda (line 387 in forward). Did you convert also all the variables with .type(torch.cuda.FloatTensor))in .type(torch.FloatTensor)?

rgowtham commented 3 years ago

Yes, the error msg that I posted was before the cuda changes. After I make the changes (to wherever cuda was used in Transfomer.py script), I am seeing the same error posted in this msg

irenebenedetto commented 3 years ago

Yes, the error msg that I posted was before the cuda changes. After I make the changes (to wherever cuda was used in Transfomer.py script), I am seeing the same error posted in this msg

A okay sorry. And did you check also the MeetingNet_transformer class here https://github.com/microsoft/HMNet/blob/1f5a24d656e8bf111560551daa66d81a5028dd93/Models/Networks/MeetingNet_Transformer.py#L85 ? (I used checkpoint = torch.load(os.path.join(load_dir, 'model.pt'), map_location=torch.device('cpu')))

rgowtham commented 3 years ago

Yes this is changed too to load from cpu

microsoft / HMNet

cublas runtime error #3