error when {num_gpus} more than 1

sparklingyueran commented 1 year ago

When the {num_gpus} is 1, 'bash tools/xxx.sh' works. But when I set {num_gpus} as 2 or 3, it shows error:

RuntimeError: Address already in us ....... subprocess.CalledProcessError: Command '[...............,'--local_rank=0']' returned non-zero exit status 1.

kennymckormick commented 1 year ago

Hi, sparklingyueran, the problem is that you have another unfinished job running, which is using the designated port. To run multiple jobs at the same time on a same machine, make sure to set a different port for each job by adding PORT=XXX (XXX is the port number) before your training / testing command.

yuchen-ji commented 1 year ago

I have the similar question. No matter one or more gpus are used, errors will occur. I tested multiple versions of mmcv/mmdet/mmpose, but it still hasn't been solved. Details are as follows: subprocess.CalledProcessError: Command '['/root/anaconda3/envs/pyskl3/bin/python', '-u', 'tools/data/custom_2d_skeleton.py', '--local_rank=7', '--video-list', 'notebook/diving48.list', '--out', 'notebook/diving48_annos.pkl']' died with <Signals.SIGSEGV: 11>. 微信图片_20221212225344

kennymckormick commented 1 year ago

Hi, xiaoxin-Crayon, I'm not sure what the problem is given the existing context. Perhaps you can delve deeper into the custom_2d_skeleton.py (run it block by block) to see what might be the issue.

yuchen-ji commented 1 year ago

Hi, xiaoxin-Crayon, I'm not sure what the problem is given the existing context. Perhaps you can delve deeper into the custom_2d_skeleton.py (run it block by block) to see what might be the issue. Thanks for your reply, I have already run it block by block I foud it error when execute in this line: https://github.com/kennymckormick/pyskl/blob/bd46e966ab0e332a5cd3a806ae319e6fd4df9ed0/tools/data/custom_2d_skeleton.py#L119

yuchen-ji commented 1 year ago

Hi, xiaoxin-Crayon, I'm not sure what the problem is given the existing context. Perhaps you can delve deeper into the custom_2d_skeleton.py (run it block by block) to see what might be the issue. I guess the package version caused this problem, could you provide the specific package versions (including cuda cudnn and conda environments)

sparklingyueran commented 1 year ago

Hi, sparklingyueran, the problem is that you have another unfinished job running, which is using the designated port. To run multiple jobs at the same time on a same machine, make sure to set a different port for each job by adding PORT=XXX (XXX is the port number) before your training / testing command.

Do you mean CUDA_VISIBLE_DEVICES=3 PORT=3 bash tools/xxx.sh 1 ? I tried this command, but it failed. Could you pleas give a specific example?

yuchen-ji commented 1 year ago

MY ERROR may be caused by the compatibility problem of pytorch distributed training caused by the mm-x version. I try different version of mmcv-full, mmdet, mmpose, finally solved it. Follow these steps:

Docker:

pull images (make sure your nvidia-driver support images's cuda version)

docker pull nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04

start container (make sure shm-size is enough)

docker run --gpus all --shm-size 40g -t -i --name ycworkspace -v /your-path/pyskl:/pyskl nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04 /bin/bash

Conda env:

pytorch version (I test cuda version==11.xx. but no one successfully run 'generate-custom-dataset' demo)

conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=10.2 -c pytorch

opencv-headless (if you use remote server which don't have GUI)

pip install opencv-python-headless

mmcv

pip install mmcv-full==1.5.0 -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.10/index.html

mmdet

git clone https://github.com/open-mmlab/mmdetection.git cd mmdetection pip install -r requirements/build.txt pip install -v -e .

mmpose

git clone https://github.com/open-mmlab/mmpose.git cd mmpose pip install -r requirements.txt pip install -v -e .

Install rest pkg in requirements list:

decord matplotlib moviepy numpy pymemcache scipy tqdm

pyskl install

pip install -e .

[when] libGL.so.1 No such file or dir

apt install ffmpeg libsm6 libxext6 -y apt install libgl1

[when] No module named 'fvcore'

pip install -U 'git+https://github.com/facebookresearch/fvcore'

kennymckormick / pyskl