Open sparklingyueran opened 1 year ago
Hi, sparklingyueran, the problem is that you have another unfinished job running, which is using the designated port. To run multiple jobs at the same time on a same machine, make sure to set a different port for each job by adding PORT=XXX (XXX is the port number)
before your training / testing command.
I have the similar question. No matter one or more gpus are used, errors will occur. I tested multiple versions of mmcv/mmdet/mmpose, but it still hasn't been solved. Details are as follows: subprocess.CalledProcessError: Command '['/root/anaconda3/envs/pyskl3/bin/python', '-u', 'tools/data/custom_2d_skeleton.py', '--local_rank=7', '--video-list', 'notebook/diving48.list', '--out', 'notebook/diving48_annos.pkl']' died with <Signals.SIGSEGV: 11>.
Hi, xiaoxin-Crayon, I'm not sure what the problem is given the existing context. Perhaps you can delve deeper into the custom_2d_skeleton.py
(run it block by block) to see what might be the issue.
Hi, xiaoxin-Crayon, I'm not sure what the problem is given the existing context. Perhaps you can delve deeper into the
custom_2d_skeleton.py
(run it block by block) to see what might be the issue. Thanks for your reply, I have already run it block by block I foud it error when execute in this line: https://github.com/kennymckormick/pyskl/blob/bd46e966ab0e332a5cd3a806ae319e6fd4df9ed0/tools/data/custom_2d_skeleton.py#L119
Hi, xiaoxin-Crayon, I'm not sure what the problem is given the existing context. Perhaps you can delve deeper into the
custom_2d_skeleton.py
(run it block by block) to see what might be the issue. I guess the package version caused this problem, could you provide the specific package versions (including cuda cudnn and conda environments)
Hi, sparklingyueran, the problem is that you have another unfinished job running, which is using the designated port. To run multiple jobs at the same time on a same machine, make sure to set a different port for each job by adding
PORT=XXX (XXX is the port number)
before your training / testing command.
Do you mean CUDA_VISIBLE_DEVICES=3 PORT=3 bash tools/xxx.sh 1
? I tried this command, but it failed. Could you pleas give a specific example?
MY ERROR may be caused by the compatibility problem of pytorch distributed training caused by the mm-x version. I try different version of mmcv-full, mmdet, mmpose, finally solved it. Follow these steps:
docker pull nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04
docker run --gpus all --shm-size 40g -t -i --name ycworkspace -v /your-path/pyskl:/pyskl nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04 /bin/bash
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=10.2 -c pytorch
pip install opencv-python-headless
pip install mmcv-full==1.5.0 -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.10/index.html
git clone https://github.com/open-mmlab/mmdetection.git cd mmdetection pip install -r requirements/build.txt pip install -v -e .
git clone https://github.com/open-mmlab/mmpose.git cd mmpose pip install -r requirements.txt pip install -v -e .
decord matplotlib moviepy numpy pymemcache scipy tqdm
pip install -e .
apt install ffmpeg libsm6 libxext6 -y apt install libgl1
pip install -U 'git+https://github.com/facebookresearch/fvcore'
When the {num_gpus} is 1, 'bash tools/xxx.sh' works. But when I set {num_gpus} as 2 or 3, it shows error:
RuntimeError: Address already in us ....... subprocess.CalledProcessError: Command '[...............,'--local_rank=0']' returned non-zero exit status 1.