Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)
MIT License
335
stars
40
forks
source link
run_funsd.py fails with NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8 #18
Above command fails with below error for pytorch 1.7.1 Cuda 11.0
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "run_funsd.py", line 369, in <module>
File "run_funsd.py", line 369, in <module>
File "run_funsd.py", line 369, in <module>
File "run_funsd.py", line 369, in <module>
main()main()
File "run_funsd.py", line 50, in main
File "run_funsd.py", line 50, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()main()
main() File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
File "run_funsd.py", line 50, in main
File "run_funsd.py", line 50, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
obj = dtype(**inputs)obj = dtype(**inputs)
File "<string>", line 67, in __init__
obj = dtype(**inputs) File "<string>", line 67, in __init__
obj = dtype(**inputs)
File "<string>", line 67, in __init__
File "<string>", line 67, in __init__
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
return func(*args, **kwargs)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
return func(*args, **kwargs)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
return func(*args, **kwargs)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
return func(*args, **kwargs)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
return self._setup_devices
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
return self._setup_devicesreturn self._setup_devices
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
return self._setup_devices
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
cached = self.fget(obj)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
cached = self.fget(obj)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
cached = self.fget(obj)
cached = self.fget(obj)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
return func(*args, **kwargs)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
return func(*args, **kwargs)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
return func(*args, **kwargs)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
return func(*args, **kwargs)
torch.distributed.init_process_group(backend="nccl")
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
torch.distributed.init_process_group(backend="nccl")
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
torch.distributed.init_process_group(backend="nccl")
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
torch.distributed.init_process_group(backend="nccl")
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
barrier()
barrier()
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
barrier()
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()work = _default_pg.barrier()
work = _default_pg.barrier()
RuntimeErrorRuntimeErrorRuntimeError: : : NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/cydal/anaconda3/envs/liltfinetune/bin/python', '-u', 'run_funsd.py', '--local_rank=3', '--model_name_or_path', 'lilt-roberta-en-base', '--tokenizer_name', 'roberta-base', '--output_dir', 'ser_funsd_lilt-roberta-en-base', '--do_train', '--do_predict', '--max_steps', '2000', '--per_device_train_batch_size', '8', '--warmup_ratio', '0.1', '--fp16']' returned non-zero exit status 1.
NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 76C P0 33W / 70W | 5874MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
If I upgrade to pytorch 1.8 with cuda 11.1 then the error is Cuda Invalid device ordinal. Trying to setup this environment from last 3 days, tried various combinations of versions none worked. Can you provide a list of dependencies with the exact versions where it can work in a new instance of Ubuntu 18.04.
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=4 run_funsd.py --model_name_or_path lilt-roberta-en-base --tokenizer_name roberta-base --output_dir ser_funsd_lilt-roberta-en-base --do_train --do_predict --max_steps 2000 --per_device_train_batch_size 8 --warmup_ratio 0.1 --fp16
Above command fails with below error for pytorch 1.7.1 Cuda 11.0
Below is conda list:
nvidia-smi
If I upgrade to pytorch 1.8 with cuda 11.1 then the error is Cuda Invalid device ordinal. Trying to setup this environment from last 3 days, tried various combinations of versions none worked. Can you provide a list of dependencies with the exact versions where it can work in a new instance of Ubuntu 18.04.