Closed letdivedeep closed 1 year ago
@YuanLiuuuuuu I was able to resolve this issue, this issue indicated that NCCL was unable to find the external plugin library libnccl-net.so and is falling back to using the internal implementation for communication. The plugin library provides optimized network transport implementations for various hardware and software environments.
By installing these packages :
sudo apt-get install libnccl2 libnccl-dev
and then adding library path :
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/
Branch
1.x branch (1.x version, such as
v1.0.0rc2
, ordev-1.x
branch)Prerequisite
Environment
sys.platform: linux Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3: NVIDIA A10G CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.12.1+cu113 PyTorch compiling details: PyTorch built with:
TorchVision: 0.13.1+cu113 OpenCV: 4.7.0 MMEngine: 0.6.0 MMCV: 2.0.0rc4 MMCV Compiler: GCC 7.5 MMCV CUDA Compiler: 11.3 MMSelfSup: 1.0.0rc6+6c13b42
Describe the bug
I have trained a pretext MIXMIM model using this config proved in the zip. started the pretext model training using the following cmd:
was able to successfully start the model training. Converted the saved checkpoints to pytorch format using this cmd :
Build the config for the downstream task (config provided in the zip ). Started the downstream linear classification task using this cmd :
then got this error as shown below
Reproduces the problem - code sample
No response
Reproduces the problem - command or script
Reproduces the problem - error message
Additional information
added the configs Archive.zip
No response