open-mmlab / mmselfsup

OpenMMLab Self-Supervised Learning Toolbox and Benchmark
https://mmselfsup.readthedocs.io/en/latest/
Apache License 2.0
3.14k stars 429 forks source link

[Bug] Invalid usage of NCCL library when starting the downstream classification task #731

Closed letdivedeep closed 1 year ago

letdivedeep commented 1 year ago

Branch

1.x branch (1.x version, such as v1.0.0rc2, or dev-1.x branch)

Prerequisite

Environment

sys.platform: linux Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3: NVIDIA A10G CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.12.1+cu113 PyTorch compiling details: PyTorch built with:

TorchVision: 0.13.1+cu113 OpenCV: 4.7.0 MMEngine: 0.6.0 MMCV: 2.0.0rc4 MMCV Compiler: GCC 7.5 MMCV CUDA Compiler: 11.3 MMSelfSup: 1.0.0rc6+6c13b42

Describe the bug

I have trained a pretext MIXMIM model using this config proved in the zip. started the pretext model training using the following cmd:

bash tools/dist_train.sh saved_models/mixmim/encoder/mixmim-base-p16_16xb128-coslr-410_encoder.py 1 --work-dir mixmim/encoder

was able to successfully start the model training. Converted the saved checkpoints to pytorch format using this cmd :

!python tools/model_converters/extract_backbone_weights.py \
  saved_models/mixmim/encoder/epoch_2.pth \
  saved_models/mixmim/encoder/mixmim_backbone-weights.pth

Build the config for the downstream task (config provided in the zip ). Started the downstream linear classification task using this cmd :

bash tools/benchmarks/classification/mim_dist_train.sh saved_models/mixmim/linear_v2_410_cls_28Mar/mixmim_vit-base-p16_8xb256-fp16-coslr-300e_in1k_410cls_linear_eval.py saved_models/mixmim/encoder_v2_410cls_pretext_27feb/cae_backbone-weights.pth

then got this error as shown below

Reproduces the problem - code sample

No response

Reproduces the problem - command or script

bash tools/benchmarks/classification/mim_dist_train.sh saved_models/mixmim/linear_v2_410_cls_28Mar/mixmim_vit-base-p16_8xb256-fp16-coslr-300e_in1k_410cls_linear_eval.py saved_models/mixmim/encoder_v2_410cls_pretext_27feb/cae_backbone-weights.pth

Reproduces the problem - error message

+ CFG=saved_models/mixmim/linear_v2_410_cls_28Mar/mixmim_vit-base-p16_8xb256-fp16-coslr-300e_in1k_410cls_linear_eval.py
+ PRETRAIN=saved_models/mixmim/encoder_v2_410cls_pretext_27feb/cae_backbone-weights.pth
+ GPUS=8
+ PY_ARGS=
++ dirname tools/benchmarks/classification/mim_dist_train.sh
+ PYTHONPATH=tools/benchmarks/classification/..:
+ mim train mmcls saved_models/mixmim/linear_v2_410_cls_28Mar/mixmim_vit-base-p16_8xb256-fp16-coslr-300e_in1k_410cls_linear_eval.py --launcher pytorch -G 8 --cfg-options model.backbone.init_cfg.type=Pretrained model.backbone.init_cfg.checkpoint=saved_models/mixmim/encoder_v2_410cls_pretext_27feb/cae_backbone-weights.pth model.backbone.init_cfg.prefix=backbone.
Using port 20966 for synchronization.
Training command is /opt/conda/bin/python -m torch.distributed.launch --nproc_per_node=8 --master_port=20966 /opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py saved_models/mixmim/linear_v2_410_cls_28Mar/mixmim_vit-base-p16_8xb256-fp16-coslr-300e_in1k_410cls_linear_eval.py --launcher pytorch --cfg-options model.backbone.init_cfg.type=Pretrained model.backbone.init_cfg.checkpoint=saved_models/mixmim/encoder_v2_410cls_pretext_27feb/cae_backbone-weights.pth model.backbone.init_cfg.prefix=backbone..
/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/opt/conda/lib/python3.7/site-packages/mmengine/utils/dl_utils/setup_env.py:57: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  'Setting MKL_NUM_THREADS environment variable for each process'
/opt/conda/lib/python3.7/site-packages/mmengine/utils/dl_utils/setup_env.py:57: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  'Setting MKL_NUM_THREADS environment variable for each process'
/opt/conda/lib/python3.7/site-packages/mmengine/utils/dl_utils/setup_env.py:57: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  'Setting MKL_NUM_THREADS environment variable for each process'
/opt/conda/lib/python3.7/site-packages/mmengine/utils/dl_utils/setup_env.py:57: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  'Setting MKL_NUM_THREADS environment variable for each process'
/opt/conda/lib/python3.7/site-packages/mmengine/utils/dl_utils/setup_env.py:57: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  'Setting MKL_NUM_THREADS environment variable for each process'
/opt/conda/lib/python3.7/site-packages/mmengine/utils/dl_utils/setup_env.py:57: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  'Setting MKL_NUM_THREADS environment variable for each process'
/opt/conda/lib/python3.7/site-packages/mmengine/utils/dl_utils/setup_env.py:57: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  'Setting MKL_NUM_THREADS environment variable for each process'
/opt/conda/lib/python3.7/site-packages/mmengine/utils/dl_utils/setup_env.py:57: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  'Setting MKL_NUM_THREADS environment variable for each process'
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 162, in <module>
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 162, in <module>
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 162, in <module>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 162, in <module>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 162, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 162, in <module>
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 162, in <module>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 162, in <module>
                main()main()main()main()

  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 155, in main
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 155, in main
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 155, in main
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 155, in main
    main()
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 155, in main
                runner = Runner.from_cfg(cfg)runner = Runner.from_cfg(cfg)    runner = Runner.from_cfg(cfg)runner = Runner.from_cfg(cfg)

main()
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg
      File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg

  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg
main()      File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 155, in main

main()  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 155, in main

runner = Runner.from_cfg(cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py", line 155, in main
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg
    runner = Runner.from_cfg(cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg
    runner = Runner.from_cfg(cfg)
    runner = Runner.from_cfg(cfg)  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg

          File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg
        cfg=cfg,cfg=cfg,cfg=cfg,cfg=cfg,

  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in __init__
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in __init__
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in __init__
      File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in __init__
cfg=cfg,
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in __init__
    cfg=cfg,
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in __init__
    cfg=cfg,
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in __init__
            cfg=cfg,self.setup_env(env_cfg)self.setup_env(env_cfg)

self.setup_env(env_cfg)  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 650, in setup_env
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in __init__
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 650, in setup_env

        self.setup_env(env_cfg)  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 650, in setup_env

self.setup_env(env_cfg)self.setup_env(env_cfg)  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 650, in setup_env

      File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 650, in setup_env
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 650, in setup_env
self.setup_env(env_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 650, in setup_env
    self.setup_env(env_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 650, in setup_env
    broadcast(timestamp)
  File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/dist.py", line 312, in broadcast
    broadcast(timestamp)
    broadcast(timestamp)  File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/dist.py", line 312, in broadcast

  File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/dist.py", line 312, in broadcast
    broadcast(timestamp)
      File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/dist.py", line 312, in broadcast
broadcast(timestamp)
  File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/dist.py", line 312, in broadcast
    broadcast(timestamp)
  File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/dist.py", line 312, in broadcast
    broadcast(timestamp)
  File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/dist.py", line 312, in broadcast
    torch_dist.broadcast(data_on_device, src, group)
    torch_dist.broadcast(data_on_device, src, group)  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast

      File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
            torch_dist.broadcast(data_on_device, src, group)torch_dist.broadcast(data_on_device, src, group)broadcast(timestamp)torch_dist.broadcast(data_on_device, src, group)

  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
torch_dist.broadcast(data_on_device, src, group)  File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/dist.py", line 312, in broadcast
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast

  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    torch_dist.broadcast(data_on_device, src, group)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    torch_dist.broadcast(data_on_device, src, group)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
        work = default_pg.broadcast([tensor], opts)work = default_pg.broadcast([tensor], opts)

RuntimeErrorRuntimeError: : NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

    work = default_pg.broadcast([tensor], opts)
RuntimeError    :     work = default_pg.broadcast([tensor], opts)NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).work = default_pg.broadcast([tensor], opts)

RuntimeErrorRuntimeError: : NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 212) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run

  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 214)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-03-28_18:55:05
  host      : 071fe4403cf1
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 215)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-03-28_18:55:05
  host      : 071fe4403cf1
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 216)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-03-28_18:55:05
  host      : 071fe4403cf1
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 217)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-03-28_18:55:05
  host      : 071fe4403cf1
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 218)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2023-03-28_18:55:05
  host      : 071fe4403cf1
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 219)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-28_18:55:05
  host      : 071fe4403cf1
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 212)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/opt/conda/bin/mim", line 8, in <module>
    sys.exit(cli())
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/mim/commands/train.py", line 111, in cli
    other_args=other_args)
  File "/opt/conda/lib/python3.7/site-packages/mim/commands/train.py", line 262, in train
    cmd, env=dict(os.environ, MASTER_PORT=str(port)))
  File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-m', 'torch.distributed.launch', '--nproc_per_node=8', '--master_port=20966', '/opt/conda/lib/python3.7/site-packages/mmcls/.mim/tools/train.py', 'saved_models/mixmim/linear_v2_410_cls_28Mar/mixmim_vit-base-p16_8xb256-fp16-coslr-300e_in1k_410cls_linear_eval.py', '--launcher', 'pytorch', '--cfg-options', 'model.backbone.init_cfg.type=Pretrained', 'model.backbone.init_cfg.checkpoint=saved_models/mixmim/encoder_v2_410cls_pretext_27feb/cae_backbone-weights.pth', 'model.backbone.init_cfg.prefix=backbone.']' returned non-zero exit status 1.

Additional information

added the configs Archive.zip

No response

letdivedeep commented 1 year ago

@YuanLiuuuuuu I was able to resolve this issue, this issue indicated that NCCL was unable to find the external plugin library libnccl-net.so and is falling back to using the internal implementation for communication. The plugin library provides optimized network transport implementations for various hardware and software environments.

By installing these packages : sudo apt-get install libnccl2 libnccl-dev

and then adding library path : export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/