hkchengrex / Mask-Propagation

[CVPR 2021] MiVOS - Mask Propagation module. Reproduced STM (and better) with training code :star2:. Semi-supervised video object segmentation evaluation.
https://hkchengrex.github.io/MiVOS/
MIT License
127 stars 22 forks source link

subprocess.CalledProcessError #3

Closed longmalongma closed 3 years ago

longmalongma commented 3 years ago

Hi, thanks for your great work! When I try to run CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 python -m torch.distrib uted.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s0 --stage 0 , I meet this problem, can you help me?

File "/home/longma/anaconda2/envs/p3torchstm/lib/python3.6/site-packages/torch/distributed/launch.py", line 242, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/longma/anaconda2/envs/p3torchstm/bin/python', '-u', 'train.py', '--local_rank=1', '--id', 'retrain_s0', '--stage', '0']' returned non-zero exit status 1.

hkchengrex commented 3 years ago

You are going to need two GPUs so there should be two CUDA IDs instead of one.

longmalongma commented 3 years ago

You are going to need two GPUs so there should be two CUDA IDs instead of one.

Thanks for your reply, but I only have have one gpu(RTX 2080).How to run it?

hkchengrex commented 3 years ago

Change the nproc_per_node parameter to 1, i.e. CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=1 train.py --id retrain_s0 --stage 0

It is also likely that you need to decrease the batch size because 2080 has less memory than the 1080Ti that we used. You can try different batch sizes like: CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=1 train.py --id retrain_s0 --stage 0 --batch_size 4

The performance would not be the same though as the effective batch size is smaller. I suggest using AMP -- it will decrease memory usage and allow you to fit more images into a single GPU and thus reduce the batch size gap.

longmalongma commented 3 years ago

Change the nproc_per_node parameter to 1, i.e. CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=1 train.py --id retrain_s0 --stage 0

It is also likely that you need to decrease the batch size because 2080 has less memory than the 1080Ti that we used. You can try different batch sizes like: CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=1 train.py --id retrain_s0 --stage 0 --batch_size 4

The performance would not be the same though as the effective batch size is smaller. I suggest using AMP -- it will decrease memory usage and allow you to fit more images into a single GPU and thus reduce the batch size gap.

Ok,thank you very much, I will continue to pay attention to your work.

longmalongma commented 3 years ago

subprocess.CalledProcessError: Command '['/data/dangjisheng/anaconda3/envs/mivos2/bin/python', '-u', 'train.py', '--local_rank=0', '--id', 'retrain_s0', '--stage', '0', '--batch_size', '4']' returned non-zero exit status 1. Hi, this error does not seem to be a problem with the GPU. Now I have a computer containing four 2080Ti computers, but I still encounter this problem.

hkchengrex commented 3 years ago

What command did you use? What is the full error message?

longmalongma commented 3 years ago

What command did you use? What is the full error message? The full error message is that: image

hkchengrex commented 3 years ago

Right on the second line, it said "CUDA driver initialization failed". You probably need to check your CUDA setup, e.g., GPU driver.

longmalongma commented 3 years ago

con

Right on the second line, it said "CUDA driver initialization failed". You probably need to check your CUDA setup, e.g., GPU driver.

Right on the second line, it said "CUDA driver initialization failed". You probably need to check your CUDA setup, e.g., GPU driver.

I think maybe it's not CUDA that's causing the problem, because I have trid it by cuda 10 and cuda 11.

longmalongma commented 3 years ago

con

Right on the second line, it said "CUDA driver initialization failed". You probably need to check your CUDA setup, e.g., GPU driver.

Right on the second line, it said "CUDA driver initialization failed". You probably need to check your CUDA setup, e.g., GPU driver.

I think maybe it's not CUDA that's causing the problem, because I have trid it by cuda 10 and cuda 11. image

hkchengrex commented 3 years ago

And now it is showing a completely different error! The folder has to be a git repo (for logging purposes, you can disable it by commenting out relevant code).

longmalongma commented 3 years ago

And now it is showing a completely different error! The folder has to be a git repo (for logging purposes, you can disable it by commenting out relevant code). image

I have conmmented it, but the error is still remains.

hkchengrex commented 3 years ago

This is a different error, check the error message carefully.

This usually means a previous zombie process is still living. Use top or nvidia-smi to locate and kill it.

longmalongma commented 3 years ago

This is a different error, check the error message carefully.

This usually means a previous zombie process is still living. Use top or nvidia-smi to locate and kill it.

之前我在4卡2080ti服务器上复现您的代码时遇到是这个报错: image 我以为是内存不足的原因,但是我再超算上用2个tesla v100复现跑您的代码时还是这个报错:

image 现在证明了应该不是我机器或者内存的问题,有什么办法解决这个报错吗?

hkchengrex commented 3 years ago

It is not a memory problem, nor are they having the same problem. Look further in the error message -- the first one indicates a driver problem, and the second says thinplate is not installed.

longmalongma commented 3 years ago

It is not a memory problem, nor are they having the same problem. Look further in the error message -- the first one indicates a driver problem, and the second says thinplate is not installe

非常感谢您的回复,我修改之后又遇到这个问题,您知道怎么解决吗? image

hkchengrex commented 3 years ago

Looks like the code has been modified (Line 28, model.py).

hkchengrex commented 3 years ago

The cause of the error is usually apparent if you read the error message thoroughly.

longmalongma commented 3 years ago

The cause of the error is usually apparent if you read the error message thoroughly.

Thak you very much! I got it.