CUDA error - Githubissues

jakubMitura14 commented 7 months ago

Hello Using your docker dpavot/kret:update I get an error

root@jm-Z490-AORUS-ULTRA:/workspaces/K-RET#  CUDA_VISIBLE_DEVICES='0,1' python3 -u run_classification.py \
>     --pretrained_model_path /workspaces/K-RET/models/pre_trained_model_scibert/output_model.bin \
>     --config_path /workspaces/K-RET/models/pre_trained_model_scibert/scibert_scivocab_uncased/config.json \
>     --vocab_path /workspaces/K-RET/models/pre_trained_model_scibert/scibert_scivocab_uncased/vocab.txt \
>     --train_path /workspaces/K-RET/datasets/pgr_corpus/train.tsv \
>     --dev_path /workspaces/K-RET/datasets/pgr_corpus/dev.tsv \
>     --test_path /workspaces/K-RET/datasets/pgr_corpus/test.tsv \
>     --class_weights True \
>     --weights "[0.234, 3.377, 4.234, 6.535, 24.613]" \
>     --epochs_num 30 \
>     --batch_size 32 \
>     --kg_name "['ChEBI']" \
>     --output_model_path /workspaces/K-RET/outputs/scibert_ddi.bin | tee /workspaces/K-RET/outputs/scibert_ddi.log &
[1] 421
root@jm-Z490-AORUS-ULTRA:/workspaces/K-RET# 
root@jm-Z490-AORUS-ULTRA:/workspaces/K-RET# [nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Vocabulary file line 30107 has bad format token
Vocabulary Size:  31090
Namespace(batch_size=32, bidirectional=False, block_size=2, class_weights='True', config_path='/workspaces/K-RET/models/pre_trained_model_scibert/scibert_scivocab_uncased/config.json', dev_path='/workspaces/K-RET/datasets/pgr_corpus/dev.tsv', dropout=0.1, emb_size=768, encoder='bert', epochs_num=30, feedforward_size=3072, heads_num=12, hidden_size=768, kernel_size=3, kg_name="['ChEBI']", labels_num=2, layers_num=12, learning_rate=2e-05, mean_reciprocal_rank=False, no_vm=False, output_model_path='/workspaces/K-RET/outputs/scibert_ddi.bin', pooling='first', pretrained_model_path='/workspaces/K-RET/models/pre_trained_model_scibert/output_model.bin', report_steps=100, seed=7, seq_length=256, sub_layers_num=2, sub_vocab_path='models/sub_vocab.txt', subencoder='avg', subword_type='none', target='bert', test_path='/workspaces/K-RET/datasets/pgr_corpus/test.tsv', testing=False, to_test_model=None, tokenizer='bert', train_path='/workspaces/K-RET/datasets/pgr_corpus/train.tsv', vocab=<uer.utils.vocab.Vocab object at 0x7f001c4a5160>, vocab_path='/workspaces/K-RET/models/pre_trained_model_scibert/scibert_scivocab_uncased/vocab.txt', warmup=0.1, weights='[0.234, 3.377, 4.234, 6.535, 24.613]', workers_num=1)
[BertClassifier] use visible_matrix: True
2 GPUs are available. Let's use them.
[KnowledgeGraph] Loading spo from /workspaces/K-RET/brain/kgs/chebi.spo
Start training.
Loading sentences from /workspaces/K-RET/datasets/pgr_corpus/train.tsv
There are 4050 sentence in total. We use 1 processes to inject knowledge into sentences.
Progress of process 0: 0/4050
Shuffling dataset
Trans data to tensor.
input_ids
label_ids
mask_ids
pos_ids
vms
Batch size:  32
The number of training instances: 4050
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
Fatal Python error: Aborted

Current thread 0x00007f0108ce5740 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/comm.py", line 40 in broadcast_coalesced
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py", line 21 in forward
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 13 in replicate
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 147 in replicate
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 142 in forward
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489 in __call__
  File "run_classification.py", line 581 in main
  File "run_classification.py", line 622 in <module>

dpavot commented 6 months ago

Hello, sorry for the delay in getting back to you. I hope you could fix it by now, but I think this is related to cuda versioning, which is always annoying to adjust. I've checked on my end, but without having more information on your specs, I can't be of much help. I found this, but I'm sure you already looked into similar things: https://stackoverflow.com/questions/66807131/how-to-solve-the-famous-unhandled-cuda-error-nccl-version-2-7-8-error I really hope you were able to find the correct configuration!

jakubMitura14 commented 6 months ago

Thanks for response; I moved to other solution in the meantime but I will consider getting back 🙂

lasigeBioTM / K-RET

CUDA error #5