Closed longmalongma closed 3 years ago
You are going to need two GPUs so there should be two CUDA IDs instead of one.
You are going to need two GPUs so there should be two CUDA IDs instead of one.
Thanks for your reply, but I only have have one gpu(RTX 2080).How to run it?
Change the nproc_per_node
parameter to 1
, i.e.
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=1 train.py --id retrain_s0 --stage 0
It is also likely that you need to decrease the batch size because 2080 has less memory than the 1080Ti that we used. You can try different batch sizes like:
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=1 train.py --id retrain_s0 --stage 0 --batch_size 4
The performance would not be the same though as the effective batch size is smaller. I suggest using AMP -- it will decrease memory usage and allow you to fit more images into a single GPU and thus reduce the batch size gap.
Change the
nproc_per_node
parameter to1
, i.e.CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=1 train.py --id retrain_s0 --stage 0
It is also likely that you need to decrease the batch size because 2080 has less memory than the 1080Ti that we used. You can try different batch sizes like:
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=1 train.py --id retrain_s0 --stage 0 --batch_size 4
The performance would not be the same though as the effective batch size is smaller. I suggest using AMP -- it will decrease memory usage and allow you to fit more images into a single GPU and thus reduce the batch size gap.
Ok,thank you very much, I will continue to pay attention to your work.
subprocess.CalledProcessError: Command '['/data/dangjisheng/anaconda3/envs/mivos2/bin/python', '-u', 'train.py', '--local_rank=0', '--id', 'retrain_s0', '--stage', '0', '--batch_size', '4']' returned non-zero exit status 1. Hi, this error does not seem to be a problem with the GPU. Now I have a computer containing four 2080Ti computers, but I still encounter this problem.
What command did you use? What is the full error message?
What command did you use? What is the full error message? The full error message is that:
Right on the second line, it said "CUDA driver initialization failed". You probably need to check your CUDA setup, e.g., GPU driver.
con
Right on the second line, it said "CUDA driver initialization failed". You probably need to check your CUDA setup, e.g., GPU driver.
Right on the second line, it said "CUDA driver initialization failed". You probably need to check your CUDA setup, e.g., GPU driver.
I think maybe it's not CUDA that's causing the problem, because I have trid it by cuda 10 and cuda 11.
con
Right on the second line, it said "CUDA driver initialization failed". You probably need to check your CUDA setup, e.g., GPU driver.
Right on the second line, it said "CUDA driver initialization failed". You probably need to check your CUDA setup, e.g., GPU driver.
I think maybe it's not CUDA that's causing the problem, because I have trid it by cuda 10 and cuda 11.
And now it is showing a completely different error! The folder has to be a git repo (for logging purposes, you can disable it by commenting out relevant code).
And now it is showing a completely different error! The folder has to be a git repo (for logging purposes, you can disable it by commenting out relevant code).
I have conmmented it, but the error is still remains.
This is a different error, check the error message carefully.
This usually means a previous zombie process is still living. Use top or nvidia-smi to locate and kill it.
This is a different error, check the error message carefully.
This usually means a previous zombie process is still living. Use top or nvidia-smi to locate and kill it.
之前我在4卡2080ti服务器上复现您的代码时遇到是这个报错: 我以为是内存不足的原因,但是我再超算上用2个tesla v100复现跑您的代码时还是这个报错:
现在证明了应该不是我机器或者内存的问题,有什么办法解决这个报错吗?
It is not a memory problem, nor are they having the same problem. Look further in the error message -- the first one indicates a driver problem, and the second says thinplate is not installed.
It is not a memory problem, nor are they having the same problem. Look further in the error message -- the first one indicates a driver problem, and the second says thinplate is not installe
非常感谢您的回复,我修改之后又遇到这个问题,您知道怎么解决吗?
Looks like the code has been modified (Line 28, model.py).
The cause of the error is usually apparent if you read the error message thoroughly.
The cause of the error is usually apparent if you read the error message thoroughly.
Thak you very much! I got it.
Hi, thanks for your great work! When I try to run CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=4 python -m torch.distrib uted.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s0 --stage 0 , I meet this problem, can you help me?
File "/home/longma/anaconda2/envs/p3torchstm/lib/python3.6/site-packages/torch/distributed/launch.py", line 242, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/longma/anaconda2/envs/p3torchstm/bin/python', '-u', 'train.py', '--local_rank=1', '--id', 'retrain_s0', '--stage', '0']' returned non-zero exit status 1.