Closed 2181382zht closed 5 years ago
Thanks. Could you let me know what are your Pytorch, CUDA, and GPU driver versions? Also, you say you were using multiple GPUs when training the model, right?
cuda10.0,cudnn7.5,python3.7,pytorch1.1.0。I'm not using multiple GPU。
Thanks. Could you copy the full error message when you have the out of memory issue? Also, is there any reason you change your batch size from 4 to 2. On my testing machine, batch size 4 should only use about 6.7GB memory which is way less than your 2080ti's total memory.
The first question that comes up is RuntimeError: CUDA error: unspecified launch failure I run train.py again and there it is RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.00 GiB total capacity; 4.16 GiB already allocated; 748.80 KiB free; 412.29 MiB cached)
I changed batch-size just to try, so as not to run out of memory thank you
Hi, the launch failure issue (as well as the OO memory issue) are most likely because of GPU drivers. Please refer to this post from Soumith (one of the major developer of Pytorch) for details - "it’d be either related to an nvidia driver bug, or that your GPU is faulty". The earlier versions of Nvidia 20 series GPU's drivers are usually not stable and buggy. Please upgrade your GPU driver to the latest stable version. If this issue still persists, please try other GPUs.
Also, try rebooting your system first if you have not yet done so. See this post
When I trained deep_lession data set, the GPU memory occupation reached 10.7G. Although it hasn't stopped at present, the running speed of the program is very slow. Of course, this is only produced after the 10th epoch
I see. So, have you tried my two previous suggestions? Also, the 10th epoch issue may be related to the model saving step, could you try change save_step
config to other numbers (say 20) and see if the issue happens at a different epoch?
I have tried the two schemes you mentioned before, and the memory overflow problem did not appear temporarily after I reinstalled the environment.Now I change save_step to 20 according to your suggestion. After the modification, it is true that the running speed of the previous 20epoch program is normal, but after saving the model, the speed becomes very slow as before
Thanks! This seems a reproducible issue. I just received an Nvidia 20 series GPU today. I will try to reproduce your bug on that GPU to see if it is a GPU dependent issue. BTW, which dataset (or both) were you using when having this bug?
I ran the same experiment on both data sets and had the same problem
Have you tested it on a 20 series gpu
Hi, sorry for getting back to you late. I tested the model on my 2080 Ti lately and everything worked just fine. So I am not so sure what is the problem on your side. Several things I may check for further verification:
I reinstalled the system and environment, tried again, and succeeded thank you
that's great! I am glad to hear that. thanks for your reporting btw.
I have before training batch size from 4 to 2, I have been watching the training of GPU memory usage, and I was not open other programs use GPU memory, top 10 epoch are no problem, but when I'm training to 11 epoch. 60 a step, I saw the memory directly occupancy rate reached 100%, is 90% before, and then an error, I think if you use two pieces of GPU, thank you