RuntimeError when testing using CoNLL2003 dataset

f1amigo commented 1 year ago

Hi, I've followed the instructions on the README to test the model on the CoNLL2003 dataset. However, I ran into a RuntimeError when I tried to run python run_ner.py conf/conll03.json. Any help in resolving this issue would be appreciated.

The following are the logs I got during the error: Traceback (most recent call last): File "/home/chenweiyi/Binder/run_ner.py", line 742, in <module> main() File "/home/chenweiyi/Binder/run_ner.py", line 683, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/transformers/trainer.py", line 1501, in train return inner_training_loop( File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/transformers/trainer.py", line 2508, in training_step loss = self.compute_loss(model, inputs) File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/transformers/trainer.py", line 2540, in compute_loss outputs = model(**inputs) File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply output.reraise() File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, **kwargs) File "/home/chenweiyi/.conda/envs/binder/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/Binder/src/model.py", line 239, in forward start_negative_mask = ner["start_negative_mask"].view(batch_size * num_types, seq_length) RuntimeError: shape '[16, 256]' is invalid for input of size 8192

mukurgupta commented 1 year ago

@f1amigo I'm also facing a similar error. Did you find a fix?

f1amigo commented 1 year ago

@mukurgupta unfortunately not. I suspect that it may be an issue with the size of the GPU as it is able to run after I switched to an RTX 3090, which is much larger than the previous GPU I was using.

andrew-umjangyun commented 1 year ago

If you are using multiple gpu, setting it to 1 gpu will solve the following error.

mukurgupta commented 1 year ago

Yes, it runs fine on 1 GPU. But I couldn't find any fix for running it on multiple GPUs.

microsoft / binder

RuntimeError when testing using CoNLL2003 dataset #2