Error found in validating when use 2 gpu(But it'ok when using one gpu )..

ZHO9504 commented 5 years ago

Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. EM: 61.2193, f1: 69.6262, qas_used_fraction: 1.0000, loss: 4.3453 ||: : 17502it [6:26:59, 1.33s/it] 2019-07-20 15:09:22,954 - INFO - allennlp.training.trainer - Validating EM: 48.9301, f1: 59.0550, qas_used_fraction: 1.0000, loss: 5.1889 ||: : 94it [00:41, 2.15it/s]Traceback (most recent call last): File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 21, in run() File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/init.py", line 102, in main args.func(args) File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args args.cache_prefix) File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file cache_directory, cache_prefix) File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 243, in train_model metrics = trainer.train() File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 493, in train val_loss, num_batches = self._validation_loss() File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss loss = self.batch_loss(batch_group, for_training=False) File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 258, in batch_loss output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices) File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/util.py", line 336, in data_parallel losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather return gather_map(outputs) File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map return Gather.apply(target_device, dim, outputs) File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/gpu245/.local/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: tensor.ndimension() == static_cast(expected_size.size()) ASSERT FAILED at /pytorch/torch/csrc/cuda/comm.cpp:232, please report a bug to PyTorch. (gather at /pytorch/torch/csrc/cuda/comm.cpp:232) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f6d3dad8441 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f6d3dad7d7a in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: torch::cuda::gather(c10::ArrayRef, long, c10::optional) + 0x962 (0x7f6d132be792 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch.so.1) frame #3: + 0x5a3d1c (0x7f6d33e0bd1c in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x130fac (0x7f6d33998fac in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #5: _PyMethodDef_RawFastCallKeywords + 0x264 (0x5567e0e3c6e4 in python3.7) frame #6: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7) frame #7: _PyEval_EvalFrameDefault + 0x4e8c (0x5567e0e982bc in python3.7) frame #8: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7) frame #9: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7) frame #10: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7) frame #11: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7) frame #12: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7) frame #13: THPFunction_apply(_object, _object*) + 0x6b1 (0x7f6d33c1c301 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #14: PyCFunction_Call + 0xe7 (0x5567e0dffbe7 in python3.7) frame #15: _PyEval_EvalFrameDefault + 0x5d21 (0x5567e0e99151 in python3.7) frame #16: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7) frame #17: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7) frame #18: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7) frame #19: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7) frame #20: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7) frame #21: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7) frame #22: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7) frame #23: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7) frame #24: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7) frame #25: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7) frame #26: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7) frame #27: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7) frame #28: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7) frame #29: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7) frame #30: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7) frame #31: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7) frame #32: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7) frame #33: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7) frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7) frame #35: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7) frame #36: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7) frame #37: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7) frame #38: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7) frame #39: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7) frame #40: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7) frame #41: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7) frame #42: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7) frame #43: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7) frame #44: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7) frame #45: PyEval_EvalCodeEx + 0x44 (0x5567e0dda3c4 in python3.7) frame #46: PyEval_EvalCode + 0x1c (0x5567e0dda3ec in python3.7) frame #47: + 0x1e004d (0x5567e0ea304d in python3.7) frame #48: _PyMethodDef_RawFastCallKeywords + 0xe9 (0x5567e0e3c569 in python3.7) frame #49: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7) frame #50: _PyEval_EvalFrameDefault + 0x4755 (0x5567e0e97b85 in python3.7) frame #51: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7) frame #52: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7) frame #53: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7) frame #54: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7) frame #55: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7) frame #56: + 0x222d77 (0x5567e0ee5d77 in python3.7) frame #57: + 0x23ae95 (0x5567e0efde95 in python3.7) frame #58: _Py_UnixMain + 0x3c (0x5567e0efdf7c in python3.7) frame #59: __libc_start_main + 0xf0 (0x7f6d4ea12830 in /lib/x86_64-linux-gnu/libc.so.6) frame #60: + 0x1e0122 (0x5567e0ea3122 in python3.7)

I don't know why....

ZHO9504 commented 5 years ago

My running script is, python3.7 -m allennlp.run train /home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/baseline/MRQA_BERTLarge.jsonnet -s Models/large_f5/ -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': '/home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/data/train/TriviaQA-web.jsonl.gz', 'validation_data_path': '/home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/data/dev-indomain/TriviaQA-web.jsonl.gz', 'trainer': {'cuda_device': [0,1], 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '50000'}}}" --include-package mrqa_allennlp

whatever the train_data_path,

alontalmor commented 5 years ago

Hi ZHO9504, i will try to reproduce this, but because this does not happen on 1 GPU it's likely to be an allennlp problem with multiGPU, which version of allennlp are you using? thanks

ZHO9504 commented 5 years ago

Hi ZHO9504, i will try to reproduce this, but because this does not happen on 1 GPU it's likely to be an allennlp problem with multiGPU, which version of allennlp are you using? thanks

Thank you for your reply. The version of allennlp I use: $ allennlp --version allennlp 0.8.5-unreleased` and had same issue using V0.8.4 torch1.1.0
It's ok when validate the data: HotpotQA\SearchQA using one or two gpu. But have the issue when valating trival/NaturalQuestionsShort/SearchQA with 2 gpu. A little strange.....

alontalmor commented 5 years ago

It sounds like some edge case that's a bit difficult to reproduce... Does it happen when you evaluate only on TriviaQA or NaturalQuestionsShort?

ZHO9504 commented 5 years ago

It sounds like some edge case that's a bit difficult to reproduce... Does it happen when you evaluate only on TriviaQA or NaturalQuestionsShort?

Yes, I evaluated on each of them , but only HotpotQA or SearchQA went well. And, as long as the evaluation data include such as TriviaQA, then procedure error

alontalmor commented 5 years ago

Ok i'm trying to recreate and solve this, but it may take a few days.

Alex-Fabbri commented 5 years ago

I also got this error during multi-gpu validation but fine on a single gpu. Using allennlp V0.8.4 and torch 1.1.0.

Kaimary commented 5 years ago

+1. I also got this error during multiple-gpu validation phrase. Using allennlp V0.8.4 and torch 1.1.0.

lucadiliello commented 2 years ago

I was able to train on every MRQA task using every number of GPUs using pytorch-lightning. I published the scripts here: https://github.com/lucadiliello/mrqa-lightning

mrqa / MRQA-Shared-Task-2019

Error found in validating when use 2 gpu(But it'ok when using one gpu ).. #17