huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.96k stars 26.28k forks source link

Longformer training : CUDA error: device-side assert triggered #10852

Closed manchandasahil closed 3 years ago

manchandasahil commented 3 years ago

Environment info

Who can help

Information

Model I am using (Bert, XLNet ...): Longformer

The problem arises when using:

The tasks I am working on is:

To reproduce

When i use the same configuration to train model type bert it works but this does not work for longformer. Steps to reproduce the behavior: /opt/conda/bin/python -m torch.distributed.launch \ --nnodes=$WORLD_SIZE \ --node_rank=$RANK \ --master_addr=$MASTER_ADDR \ --master_port=$MASTER_PORT \ --nproc_per_node=1 $SCRIPT \ --output_dir=$OUT_DIR \ --logging_dir=$OUT_DIR \ --tokenizer_name=$TOKENIZER \ --model_type=longformer --do_train --do_eval \ --cache_dir=$CACHE_DIR \ --overwrite_cache \ --validation_file=$EVAL_DATA \ --overwrite_output_dir \ --train_file=$TRAIN_DATA_FOLDER \ --dataset_name=$DATASET_NAME \ --line_by_line \ --learning_rate=${INIT_LR} \ --save_steps=${SAVE_STEPS} \ --max_seq_length=${BLOCK_SIZE} \ --gradient_accumulation_steps=${GRAD_ACCUM_STEPS} \ --fp16 \ --num_train_epochs=$EPOCHS \ --per_device_train_batch_size=$BATCH_SIZE_PER_GPU \ --local_rank=$LOCAL_RANK \ --train_dataset_info_path=$TRAIN_DATASET_INFO \ --test_dataset_info_path=$TEST_DATASET_INFO \ --sharded_ddp \

Traceback (most recent call last): File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 661, in main() File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 465, in main train_result = trainer.train(resume_from_checkpoint=model_path) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1003, in train tr_loss += self.training_step(model, inputs) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step loss = self.compute_loss(model, inputs) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1477, in compute_loss outputs = model(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 218, in forward return self.module(inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1765, in forward return_dict=return_dict, File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1669, in forward return_dict=return_dict, File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1245, in forward Traceback (most recent call last): File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 661, in Traceback (most recent call last): File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 661, in is_global_attn = is_index_global_attn.flatten().any().item() RuntimeError: CUDA error: device-side assert triggered main() File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 465, in main main() File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 465, in main train_result = trainer.train(resume_from_checkpoint=model_path) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1003, in train train_result = trainer.train(resume_from_checkpoint=model_path) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1003, in train tr_loss += self.training_step(model, inputs) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step tr_loss += self.training_step(model, inputs) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step loss = self.compute_loss(model, inputs) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1477, in compute_loss loss = self.compute_loss(model, inputs) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1477, in compute_loss outputs = model(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl outputs = model(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 218, in forward result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 218, in forward return self.module(inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl return self.module(*inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1765, in forward result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1765, in forward Traceback (most recent call last): File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 661, in return_dict=return_dict, File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl return_dict=return_dict, File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1669, in forward result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1669, in forward Traceback (most recent call last): File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 661, in return_dict=return_dict, File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl return_dict=return_dict, File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl main() File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 465, in main result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1245, in forward result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1245, in forward is_global_attn = is_index_global_attn.flatten().any().item() RuntimeError: CUDA error: device-side assert triggered is_global_attn = is_index_global_attn.flatten().any().item() RuntimeError: CUDA error: device-side assert triggered train_result = trainer.train(resume_from_checkpoint=model_path) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1003, in train main() File "/data/atc_tenant/bert_data/smancha5/run_mlm.py", line 465, in main tr_loss += self.training_step(model, inputs) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step loss = self.compute_loss(model, inputs) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1477, in compute_loss train_result = trainer.train(resume_from_checkpoint=model_path) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1003, in train outputs = model(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl tr_loss += self.training_step(model, inputs) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1443, in training_step result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 218, in forward return self.module(*inputs, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1765, in forward loss = self.compute_loss(model, inputs) File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1477, in compute_loss return_dict=return_dict, File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl outputs = model(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1669, in forward result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 218, in forward return self.module(*inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1765, in forward return_dict=return_dict, File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1245, in forward return_dict=return_dict, File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl is_global_attn = is_index_global_attn.flatten().any().item() RuntimeError: CUDA error: device-side assert triggered result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1669, in forward return_dict=return_dict, File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/models/longformer/modeling_longformer.py", line 1245, in forward is_global_attn = is_index_global_attn.flatten().any().item() RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7fc78c43d99b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0xc10 (0x7fc78c680280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc78c425dfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: + 0x5414e2 (0x7fc7c549d4e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x19aaae (0x5603f8975aae in /opt/conda/bin/python) frame #5: + 0xf2868 (0x5603f88cd868 in /opt/conda/bin/python) frame #6: + 0x1f0d91 (0x5603f89cbd91 in /opt/conda/bin/python) frame #7: + 0xf270d (0x5603f88cd70d in /opt/conda/bin/python) frame #8: + 0x19aa90 (0x5603f8975a90 in /opt/conda/bin/python) frame #9: + 0xf2868 (0x5603f88cd868 in /opt/conda/bin/python) frame #10: + 0x1f0d91 (0x5603f89cbd91 in /opt/conda/bin/python) frame #11: + 0xf2828 (0x5603f88cd828 in /opt/conda/bin/python) frame #12: + 0x19aa90 (0x5603f8975a90 in /opt/conda/bin/python) frame #13: + 0xf2868 (0x5603f88cd868 in /opt/conda/bin/python) frame #14: + 0x1f0d91 (0x5603f89cbd91 in /opt/conda/bin/python) frame #15: + 0x1688cb (0x5603f89438cb in /opt/conda/bin/python) frame #16: _PyGC_CollectNoFail + 0x2a (0x5603f89cb79a in /opt/conda/bin/python) frame #17: PyImport_Cleanup + 0x278 (0x5603f897ffa8 in /opt/conda/bin/python) frame #18: Py_FinalizeEx + 0x61 (0x5603f89ea961 in /opt/conda/bin/python) frame #19: Py_Main + 0x35e (0x5603f89f4cae in /opt/conda/bin/python) frame #20: main + 0xee (0x5603f88bef2e in /opt/conda/bin/python) frame #21: __libc_start_main + 0xe7 (0x7fc7f2cf3b97 in /lib/x86_64-linux-gnu/libc.so.6) frame #22: + 0x1c327f (0x5603f899e27f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7fa371cb999b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7fa371efc280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fa371ca1dfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: + 0x5414e2 (0x7fa3aad194e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x19aaae (0x5559699ffaae in /opt/conda/bin/python) frame #5: + 0xf2868 (0x555969957868 in /opt/conda/bin/python) frame #6: + 0x1f0d91 (0x555969a55d91 in /opt/conda/bin/python) frame #7: + 0xf270d (0x55596995770d in /opt/conda/bin/python) frame #8: + 0x19aa90 (0x5559699ffa90 in /opt/conda/bin/python) frame #9: + 0xf2868 (0x555969957868 in /opt/conda/bin/python) frame #10: + 0x1f0d91 (0x555969a55d91 in /opt/conda/bin/python) frame #11: + 0xf2828 (0x555969957828 in /opt/conda/bin/python) frame #12: + 0x19aa90 (0x5559699ffa90 in /opt/conda/bin/python) frame #13: + 0xf2868 (0x555969957868 in /opt/conda/bin/python) frame #14: + 0x1f0d91 (0x555969a55d91 in /opt/conda/bin/python) frame #15: + 0x1688cb (0x5559699cd8cb in /opt/conda/bin/python) frame #16: _PyGC_CollectNoFail + 0x2a (0x555969a5579a in /opt/conda/bin/python) frame #17: PyImport_Cleanup + 0x278 (0x555969a09fa8 in /opt/conda/bin/python) frame #18: Py_FinalizeEx + 0x61 (0x555969a74961 in /opt/conda/bin/python) frame #19: Py_Main + 0x35e (0x555969a7ecae in /opt/conda/bin/python) frame #20: main + 0xee (0x555969948f2e in /opt/conda/bin/python) frame #21: __libc_start_main + 0xe7 (0x7fa3d856fb97 in /lib/x86_64-linux-gnu/libc.so.6) frame #22: + 0x1c327f (0x555969a2827f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f121fb5299b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7f121fd95280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f121fb3adfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: + 0x5414e2 (0x7f1258bb24e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x19aaae (0x5601c5024aae in /opt/conda/bin/python) frame #5: + 0xf2868 (0x5601c4f7c868 in /opt/conda/bin/python) frame #6: + 0x1f0d91 (0x5601c507ad91 in /opt/conda/bin/python) frame #7: + 0xf270d (0x5601c4f7c70d in /opt/conda/bin/python) frame #8: + 0x19aa90 (0x5601c5024a90 in /opt/conda/bin/python) frame #9: + 0xf2868 (0x5601c4f7c868 in /opt/conda/bin/python) frame #10: + 0x1f0d91 (0x5601c507ad91 in /opt/conda/bin/python) frame #11: + 0xf2828 (0x5601c4f7c828 in /opt/conda/bin/python) frame #12: + 0x19aa90 (0x5601c5024a90 in /opt/conda/bin/python) frame #13: + 0xf2868 (0x5601c4f7c868 in /opt/conda/bin/python) frame #14: + 0x1f0d91 (0x5601c507ad91 in /opt/conda/bin/python) frame #15: + 0x1688cb (0x5601c4ff28cb in /opt/conda/bin/python) frame #16: _PyGC_CollectNoFail + 0x2a (0x5601c507a79a in /opt/conda/bin/python) frame #17: PyImport_Cleanup + 0x278 (0x5601c502efa8 in /opt/conda/bin/python) frame #18: Py_FinalizeEx + 0x61 (0x5601c5099961 in /opt/conda/bin/python) frame #19: Py_Main + 0x35e (0x5601c50a3cae in /opt/conda/bin/python) frame #20: main + 0xee (0x5601c4f6df2e in /opt/conda/bin/python) frame #21: __libc_start_main + 0xe7 (0x7f1286408b97 in /lib/x86_64-linux-gnu/libc.so.6) frame #22: + 0x1c327f (0x5601c504d27f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7fe94f54799b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7fe94f78a280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe94f52fdfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: + 0x5414e2 (0x7fe9885a74e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x19aaae (0x55ab4542baae in /opt/conda/bin/python) frame #5: + 0xf2868 (0x55ab45383868 in /opt/conda/bin/python) frame #6: + 0x1f0d91 (0x55ab45481d91 in /opt/conda/bin/python) frame #7: + 0xf270d (0x55ab4538370d in /opt/conda/bin/python) frame #8: + 0x19aa90 (0x55ab4542ba90 in /opt/conda/bin/python) frame #9: + 0xf2868 (0x55ab45383868 in /opt/conda/bin/python) frame #10: + 0x1f0d91 (0x55ab45481d91 in /opt/conda/bin/python) frame #11: + 0xf2828 (0x55ab45383828 in /opt/conda/bin/python) frame #12: + 0x19aa90 (0x55ab4542ba90 in /opt/conda/bin/python) frame #13: + 0xf2868 (0x55ab45383868 in /opt/conda/bin/python) frame #14: + 0x1f0d91 (0x55ab45481d91 in /opt/conda/bin/python) frame #15: + 0x1688cb (0x55ab453f98cb in /opt/conda/bin/python) frame #16: _PyGC_CollectNoFail + 0x2a (0x55ab4548179a in /opt/conda/bin/python) frame #17: PyImport_Cleanup + 0x278 (0x55ab45435fa8 in /opt/conda/bin/python) frame #18: Py_FinalizeEx + 0x61 (0x55ab454a0961 in /opt/conda/bin/python) frame #19: Py_Main + 0x35e (0x55ab454aacae in /opt/conda/bin/python) frame #20: main + 0xee (0x55ab45374f2e in /opt/conda/bin/python) frame #21: __libc_start_main + 0xe7 (0x7fe9b5dfdb97 in /lib/x86_64-linux-gnu/libc.so.6) frame #22: + 0x1c327f (0x55ab4545427f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7fce50e8399b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7fce510c6280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fce50e6bdfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: + 0x5414e2 (0x7fce89ee34e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x19aaae (0x55919a5ffaae in /opt/conda/bin/python) frame #5: + 0xf2868 (0x55919a557868 in /opt/conda/bin/python) frame #6: + 0x1f0d91 (0x55919a655d91 in /opt/conda/bin/python) frame #7: + 0xf270d (0x55919a55770d in /opt/conda/bin/python) frame #8: + 0x19aa90 (0x55919a5ffa90 in /opt/conda/bin/python) frame #9: + 0xf2868 (0x55919a557868 in /opt/conda/bin/python) frame #10: + 0x1f0d91 (0x55919a655d91 in /opt/conda/bin/python) frame #11: + 0xf2828 (0x55919a557828 in /opt/conda/bin/python) frame #12: + 0x19aa90 (0x55919a5ffa90 in /opt/conda/bin/python) frame #13: + 0xf2868 (0x55919a557868 in /opt/conda/bin/python) frame #14: + 0x1f0d91 (0x55919a655d91 in /opt/conda/bin/python) frame #15: + 0x1688cb (0x55919a5cd8cb in /opt/conda/bin/python) frame #16: _PyGC_CollectNoFail + 0x2a (0x55919a65579a in /opt/conda/bin/python) frame #17: PyImport_Cleanup + 0x278 (0x55919a609fa8 in /opt/conda/bin/python) frame #18: Py_FinalizeEx + 0x61 (0x55919a674961 in /opt/conda/bin/python) frame #19: Py_Main + 0x35e (0x55919a67ecae in /opt/conda/bin/python) frame #20: main + 0xee (0x55919a548f2e in /opt/conda/bin/python) frame #21: __libc_start_main + 0xe7 (0x7fceb7739b97 in /lib/x86_64-linux-gnu/libc.so.6) frame #22: + 0x1c327f (0x55919a62827f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f01ad8c799b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7f01adb0a280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f01ad8afdfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: + 0x5414e2 (0x7f01e69274e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x19aaae (0x55c9bc565aae in /opt/conda/bin/python) frame #5: + 0xf2868 (0x55c9bc4bd868 in /opt/conda/bin/python) frame #6: + 0x1f0d91 (0x55c9bc5bbd91 in /opt/conda/bin/python) frame #7: + 0xf270d (0x55c9bc4bd70d in /opt/conda/bin/python) frame #8: + 0x19aa90 (0x55c9bc565a90 in /opt/conda/bin/python) frame #9: + 0xf2868 (0x55c9bc4bd868 in /opt/conda/bin/python) frame #10: + 0x1f0d91 (0x55c9bc5bbd91 in /opt/conda/bin/python) frame #11: + 0xf2828 (0x55c9bc4bd828 in /opt/conda/bin/python) frame #12: + 0x19aa90 (0x55c9bc565a90 in /opt/conda/bin/python) frame #13: + 0xf2868 (0x55c9bc4bd868 in /opt/conda/bin/python) frame #14: + 0x1f0d91 (0x55c9bc5bbd91 in /opt/conda/bin/python) frame #15: + 0x1688cb (0x55c9bc5338cb in /opt/conda/bin/python) frame #16: _PyGC_CollectNoFail + 0x2a (0x55c9bc5bb79a in /opt/conda/bin/python) frame #17: PyImport_Cleanup + 0x278 (0x55c9bc56ffa8 in /opt/conda/bin/python) frame #18: Py_FinalizeEx + 0x61 (0x55c9bc5da961 in /opt/conda/bin/python) frame #19: Py_Main + 0x35e (0x55c9bc5e4cae in /opt/conda/bin/python) frame #20: main + 0xee (0x55c9bc4aef2e in /opt/conda/bin/python) frame #21: __libc_start_main + 0xe7 (0x7f021417db97 in /lib/x86_64-linux-gnu/libc.so.6) frame #22: + 0x1c327f (0x55c9bc58e27f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7ff569f1599b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7ff56a158280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7ff569efddfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: + 0x5414e2 (0x7ff5a2f754e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x19aaae (0x562bbdb46aae in /opt/conda/bin/python) frame #5: + 0xf2868 (0x562bbda9e868 in /opt/conda/bin/python) frame #6: + 0x1f0d91 (0x562bbdb9cd91 in /opt/conda/bin/python) frame #7: + 0xf270d (0x562bbda9e70d in /opt/conda/bin/python) frame #8: + 0x19aa90 (0x562bbdb46a90 in /opt/conda/bin/python) frame #9: + 0xf2868 (0x562bbda9e868 in /opt/conda/bin/python) frame #10: + 0x1f0d91 (0x562bbdb9cd91 in /opt/conda/bin/python) frame #11: + 0xf2828 (0x562bbda9e828 in /opt/conda/bin/python) frame #12: + 0x19aa90 (0x562bbdb46a90 in /opt/conda/bin/python) frame #13: + 0xf2868 (0x562bbda9e868 in /opt/conda/bin/python) frame #14: + 0x1f0d91 (0x562bbdb9cd91 in /opt/conda/bin/python) frame #15: + 0x1688cb (0x562bbdb148cb in /opt/conda/bin/python) frame #16: _PyGC_CollectNoFail + 0x2a (0x562bbdb9c79a in /opt/conda/bin/python) frame #17: PyImport_Cleanup + 0x278 (0x562bbdb50fa8 in /opt/conda/bin/python) frame #18: Py_FinalizeEx + 0x61 (0x562bbdbbb961 in /opt/conda/bin/python) frame #19: Py_Main + 0x35e (0x562bbdbc5cae in /opt/conda/bin/python) frame #20: main + 0xee (0x562bbda8ff2e in /opt/conda/bin/python) frame #21: __libc_start_main + 0xe7 (0x7ff5d07cbb97 in /lib/x86_64-linux-gnu/libc.so.6) frame #22: + 0x1c327f (0x562bbdb6f27f in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f9808d0299b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7f9808f45280 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f9808ceadfd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) frame #3: + 0x5414e2 (0x7f9841d624e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x19aaae (0x55ba33d58aae in /opt/conda/bin/python) frame #5: + 0xf2868 (0x55ba33cb0868 in /opt/conda/bin/python) frame #6: + 0x1f0d91 (0x55ba33daed91 in /opt/conda/bin/python) frame #7: + 0xf270d (0x55ba33cb070d in /opt/conda/bin/python) frame #8: + 0x19aa90 (0x55ba33d58a90 in /opt/conda/bin/python) frame #9: + 0xf2868 (0x55ba33cb0868 in /opt/conda/bin/python) frame #10: + 0x1f0d91 (0x55ba33daed91 in /opt/conda/bin/python) frame #11: + 0xf2828 (0x55ba33cb0828 in /opt/conda/bin/python) frame #12: + 0x19aa90 (0x55ba33d58a90 in /opt/conda/bin/python) frame #13: + 0xf2868 (0x55ba33cb0868 in /opt/conda/bin/python) frame #14: + 0x1f0d91 (0x55ba33daed91 in /opt/conda/bin/python) frame #15: + 0x1688cb (0x55ba33d268cb in /opt/conda/bin/python) frame #16: _PyGC_CollectNoFail + 0x2a (0x55ba33dae79a in /opt/conda/bin/python) frame #17: PyImport_Cleanup + 0x278 (0x55ba33d62fa8 in /opt/conda/bin/python) frame #18: Py_FinalizeEx + 0x61 (0x55ba33dcd961 in /opt/conda/bin/python) frame #19: Py_Main + 0x35e (0x55ba33dd7cae in /opt/conda/bin/python) frame #20: main + 0xee (0x55ba33ca1f2e in /opt/conda/bin/python) frame #21: __libc_start_main + 0xe7 (0x7f986f5b8b97 in /lib/x86_64-linux-gnu/libc.so.6) frame #22: + 0x1c327f (0x55ba33d8127f in /opt/conda/bin/python)

Expected behavior

matteomedioli commented 3 years ago

Seems like my issue. Maybe can help: https://github.com/huggingface/transformers/issues/10832

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

happy-nlp commented 2 years ago

How to fix it?I come up with this issue too.

akedjouadj commented 1 year ago

Also

jonathanvevance commented 5 months ago

Any fix for this? Facing the same issue