run_funsd.py fails with NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=4 run_funsd.py --model_name_or_path lilt-roberta-en-base --tokenizer_name roberta-base --output_dir ser_funsd_lilt-roberta-en-base --do_train --do_predict --max_steps 2000 --per_device_train_batch_size 8 --warmup_ratio 0.1 --fp16

Above command fails with below error for pytorch 1.7.1 Cuda 11.0

Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "run_funsd.py", line 369, in <module>
  File "run_funsd.py", line 369, in <module>
  File "run_funsd.py", line 369, in <module>
  File "run_funsd.py", line 369, in <module>
        main()main()

  File "run_funsd.py", line 50, in main
  File "run_funsd.py", line 50, in main
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()main()    

main()  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses

  File "run_funsd.py", line 50, in main
  File "run_funsd.py", line 50, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses

  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
        obj = dtype(**inputs)obj = dtype(**inputs)

      File "<string>", line 67, in __init__
obj = dtype(**inputs)  File "<string>", line 67, in __init__

obj = dtype(**inputs)
  File "<string>", line 67, in __init__
  File "<string>", line 67, in __init__
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
        if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):

  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
        if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):

  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
    return self._setup_devices
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
        return self._setup_devicesreturn self._setup_devices

  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
    return self._setup_devices
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
    cached = self.fget(obj)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    cached = self.fget(obj)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    cached = self.fget(obj)    
cached = self.fget(obj)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
    return func(*args, **kwargs)
    torch.distributed.init_process_group(backend="nccl")
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
      File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
torch.distributed.init_process_group(backend="nccl")
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    torch.distributed.init_process_group(backend="nccl")
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    torch.distributed.init_process_group(backend="nccl")
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    barrier()    
barrier()
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    barrier()
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
        work = _default_pg.barrier()work = _default_pg.barrier()

work = _default_pg.barrier()
RuntimeErrorRuntimeErrorRuntimeError: : : NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/cydal/anaconda3/envs/liltfinetune/bin/python', '-u', 'run_funsd.py', '--local_rank=3', '--model_name_or_path', 'lilt-roberta-en-base', '--tokenizer_name', 'roberta-base', '--output_dir', 'ser_funsd_lilt-roberta-en-base', '--do_train', '--do_predict', '--max_steps', '2000', '--per_device_train_batch_size', '8', '--warmup_ratio', '0.1', '--fp16']' returned non-zero exit status 1.

Below is conda list:

# packages in environment at /home/cydal/anaconda3/envs/liltfinetune:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   1.2.0                    pypi_0    pypi
antlr4-python3-runtime    4.9.3                    pypi_0    pypi
appdirs                   1.4.4                    pypi_0    pypi
astunparse                1.6.3                      py_0  
black                     21.4b2                   pypi_0    pypi
blas                      1.0                         mkl  
brotlipy                  0.7.0           py37h27cfd23_1003  
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.18.1               h7f8727e_0  
ca-certificates           2022.07.19           h06a4308_0  
cachetools                5.2.0                    pypi_0    pypi
certifi                   2022.6.15        py37h06a4308_0  
cffi                      1.15.1           py37h74dc2b5_0  
charset-normalizer        2.1.1                    pypi_0    pypi
click                     8.1.3                    pypi_0    pypi
cloudpickle               2.2.0                    pypi_0    pypi
cmake                     3.19.6               h973ab73_0  
cryptography              37.0.1           py37h9ce1e76_0  
cudatoolkit               11.0.221             h6bb024c_0  
cycler                    0.11.0                   pypi_0    pypi
dataclasses               0.8                pyh6d0b6a4_7  
datasets                  1.6.2                    pypi_0    pypi
detectron2                0.5+cu110                pypi_0    pypi
dill                      0.3.5.1                  pypi_0    pypi
expat                     2.4.4                h295c915_0  
filelock                  3.8.0                    pypi_0    pypi
fonttools                 4.37.1                   pypi_0    pypi
freetype                  2.11.0               h70c0345_0  
fsspec                    2022.8.2                 pypi_0    pypi
future                    0.18.2                   py37_1  
fvcore                    0.1.5.post20220512          pypi_0    pypi
giflib                    5.2.1                h7b6447c_0  
google-auth               2.11.0                   pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
grpcio                    1.48.1                   pypi_0    pypi
huggingface-hub           0.0.19                   pypi_0    pypi
hydra-core                1.2.0                    pypi_0    pypi
idna                      3.3                pyhd3eb1b0_0  
importlib-metadata        4.12.0                   pypi_0    pypi
importlib-resources       5.9.0                    pypi_0    pypi
intel-openmp              2021.4.0          h06a4308_3561  
iopath                    0.1.8                    pypi_0    pypi
joblib                    1.1.0                    pypi_0    pypi
jpeg                      9b                   h024ee3a_2  
kiwisolver                1.4.4                    pypi_0    pypi
krb5                      1.19.2               hac12032_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
libcurl                   7.84.0               h91b91d3_0  
libedit                   3.1.20210910         h7f8727e_0  
libev                     4.33                 h7f8727e_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libnghttp2                1.46.0               hce63b2e_0  
libpng                    1.6.37               hbc83047_0  
libssh2                   1.10.0               h8f2d780_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtiff                   4.1.0                h2733197_1  
libuv                     1.40.0               h7b6447c_0  
libwebp                   1.2.0                h89dd481_0  
liltfinetune              1.0                      pypi_0    pypi
lz4-c                     1.9.3                h295c915_1  
magma-cuda110             2.5.2                         1    pytorch
markdown                  3.4.1                    pypi_0    pypi
markupsafe                2.1.1                    pypi_0    pypi
matplotlib                3.5.3                    pypi_0    pypi
mkl                       2021.4.0           h06a4308_640  
mkl-include               2022.1.0           h06a4308_224  
mkl-service               2.4.0            py37h7f8727e_0  
mkl_fft                   1.3.1            py37hd3c417c_0  
mkl_random                1.2.2            py37h51133e4_0  
multiprocess              0.70.13                  pypi_0    pypi
mypy-extensions           0.4.3                    pypi_0    pypi
ncurses                   6.3                  h5eee18b_3  
ninja                     1.10.2               h06a4308_5  
ninja-base                1.10.2               hd09550d_5  
numpy                     1.21.6                   pypi_0    pypi
numpy-base                1.21.5           py37ha15fc14_3  
oauthlib                  3.2.1                    pypi_0    pypi
omegaconf                 2.2.3                    pypi_0    pypi
openssl                   1.1.1q               h7f8727e_0  
packaging                 21.3                     pypi_0    pypi
pandas                    1.3.5                    pypi_0    pypi
pathspec                  0.10.1                   pypi_0    pypi
pillow                    9.2.0                    pypi_0    pypi
pip                       22.1.2           py37h06a4308_0  
portalocker               2.5.1                    pypi_0    pypi
protobuf                  3.19.4                   pypi_0    pypi
pyarrow                   9.0.0                    pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pycocotools               2.0.4                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pydot                     1.4.2                    pypi_0    pypi
pyopenssl                 22.0.0             pyhd3eb1b0_0  
pyparsing                 3.0.9                    pypi_0    pypi
pysocks                   1.7.1                    py37_1  
python                    3.7.13               h12debd9_0  
python-dateutil           2.8.2                    pypi_0    pypi
pytorch                   1.7.1           py3.7_cuda11.0.221_cudnn8.0.5_0    pytorch
pytz                      2022.2.1                 pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
readline                  8.1.2                h7f8727e_1  
regex                     2022.9.13                pypi_0    pypi
requests                  2.28.1           py37h06a4308_0  
requests-oauthlib         1.3.1                    pypi_0    pypi
rhash                     1.4.1                h3c74f83_1  
rsa                       4.9                      pypi_0    pypi
sacremoses                0.0.53                   pypi_0    pypi
scikit-learn              1.0.2                    pypi_0    pypi
scipy                     1.7.3                    pypi_0    pypi
seqeval                   1.2.2                    pypi_0    pypi
setuptools                63.4.1           py37h06a4308_0  
six                       1.16.0             pyhd3eb1b0_1  
sqlite                    3.39.2               h5082296_0  
tabulate                  0.8.10                   pypi_0    pypi
tensorboard               2.10.0                   pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
termcolor                 2.0.1                    pypi_0    pypi
threadpoolctl             3.1.0                    pypi_0    pypi
tk                        8.6.12               h1ccaba5_0  
tokenizers                0.10.3                   pypi_0    pypi
toml                      0.10.2                   pypi_0    pypi
torch                     1.7.1+cu110              pypi_0    pypi
torchaudio                0.7.2                    pypi_0    pypi
torchvision               0.8.2+cu110              pypi_0    pypi
tqdm                      4.49.0                   pypi_0    pypi
transformers              4.5.1                    pypi_0    pypi
typed-ast                 1.5.4                    pypi_0    pypi
typing_extensions         4.3.0            py37h06a4308_0  
urllib3                   1.26.12                  pypi_0    pypi
werkzeug                  2.2.2                    pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0  
xxhash                    3.0.0                    pypi_0    pypi
xz                        5.2.5                h7f8727e_1  
yacs                      0.1.8                    pypi_0    pypi
yaml                      0.2.5                h7b6447c_0  
zipp                      3.8.1                    pypi_0    pypi
zlib                      1.2.12               h5eee18b_3  
zstd                      1.4.9                haebb681_0

nvidia-smi

 NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   76C    P0    33W /  70W |   5874MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

If I upgrade to pytorch 1.8 with cuda 11.1 then the error is Cuda Invalid device ordinal. Trying to setup this environment from last 3 days, tried various combinations of versions none worked. Can you provide a list of dependencies with the exact versions where it can work in a new instance of Ubuntu 18.04.

jpWang / LiLT

run_funsd.py fails with NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8 #18