GPU error - Githubissues

2181382zht commented 5 years ago

I have before training batch size from 4 to 2, I have been watching the training of GPU memory usage, and I was not open other programs use GPU memory, top 10 epoch are no problem, but when I'm training to 11 epoch. 60 a step, I saw the memory directly occupancy rate reached 100%, is 90% before, and then an error, I think if you use two pieces of GPU, thank you

liaohaofu commented 5 years ago

Thanks. Could you let me know what are your Pytorch, CUDA, and GPU driver versions? Also, you say you were using multiple GPUs when training the model, right?

2181382zht commented 5 years ago

cuda10.0，cudnn7.5，python3.7，pytorch1.1.0。I'm not using multiple GPU。

liaohaofu commented 5 years ago

Thanks. Could you copy the full error message when you have the out of memory issue? Also, is there any reason you change your batch size from 4 to 2. On my testing machine, batch size 4 should only use about 6.7GB memory which is way less than your 2080ti's total memory.

2181382zht commented 5 years ago

The first question that comes up is RuntimeError: CUDA error: unspecified launch failure I run train.py again and there it is RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.00 GiB total capacity; 4.16 GiB already allocated; 748.80 KiB free; 412.29 MiB cached)

I changed batch-size just to try, so as not to run out of memory thank you

liaohaofu commented 5 years ago

Hi, the launch failure issue (as well as the OO memory issue) are most likely because of GPU drivers. Please refer to this post from Soumith (one of the major developer of Pytorch) for details - "it’d be either related to an nvidia driver bug, or that your GPU is faulty". The earlier versions of Nvidia 20 series GPU's drivers are usually not stable and buggy. Please upgrade your GPU driver to the latest stable version. If this issue still persists, please try other GPUs.

liaohaofu commented 5 years ago

Also, try rebooting your system first if you have not yet done so. See this post

2181382zht commented 5 years ago

When I trained deep_lession data set, the GPU memory occupation reached 10.7G. Although it hasn't stopped at present, the running speed of the program is very slow. Of course, this is only produced after the 10th epoch

liaohaofu commented 5 years ago

I see. So, have you tried my two previous suggestions? Also, the 10th epoch issue may be related to the model saving step, could you try change save_step config to other numbers (say 20) and see if the issue happens at a different epoch?

2181382zht commented 5 years ago

I have tried the two schemes you mentioned before, and the memory overflow problem did not appear temporarily after I reinstalled the environment.Now I change save_step to 20 according to your suggestion. After the modification, it is true that the running speed of the previous 20epoch program is normal, but after saving the model, the speed becomes very slow as before

liaohaofu commented 5 years ago

Thanks! This seems a reproducible issue. I just received an Nvidia 20 series GPU today. I will try to reproduce your bug on that GPU to see if it is a GPU dependent issue. BTW, which dataset (or both) were you using when having this bug?

2181382zht commented 5 years ago

I ran the same experiment on both data sets and had the same problem

2181382zht commented 5 years ago

Have you tested it on a 20 series gpu

liaohaofu commented 5 years ago

Hi, sorry for getting back to you late. I tested the model on my 2080 Ti lately and everything worked just fine. So I am not so sure what is the problem on your side. Several things I may check for further verification:

try run other pytorch models (say ResNet with ImageNet data) and save the model to see if you have the same problem. In this way, we can know if the problem is specific to my code.
there might also be problems related to your development environment. I run everything directly from linux command line. So if you run the code from some IDE say PyCharm there might be some unknown issues.
there is a little chance that you might have changed the code unexpectedly. So it is suggested you clone the latest version of the ADN code and run everything again.
this might be unrelated but my 2080 Ti comes with a defect and thus reports the "unspecified launch failure" issue you encountered previously (that's also why I delayed my tests...). I replaced the 2080 Ti with a new one and reinstalled my OS. Then everything works properly. So if it is also the case for you, you may have to change another GPU and try again.

2181382zht commented 5 years ago

I reinstalled the system and environment, tried again, and succeeded thank you

liaohaofu commented 5 years ago

that's great! I am glad to hear that. thanks for your reporting btw.

liaohaofu / adn

GPU error #3