Not able to train the model on GPU

mit-acl / rl_collision_avoidance

Training code for GA3C-CADRL algorithm (collision avoidance with deep RL)

117 stars 28 forks source link

Not able to train the model on GPU #21

Closed SacheetA closed 1 year ago

SacheetA commented 1 year ago

I am getting the error: 'could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error' (Both on Google Colab and Personal device) and training happens on CPU. It has come to my notice that this issue might have been resulted due to multiprocessing, but I'm not able to figure out what changes I need to make. Please provide some assistance if possible, meanwhile i'm trying to look for a solution elsewhere.

mfe7 commented 1 year ago

Do you need to train on GPU? In my experience with this repo and linux machine, training on CPU was actually faster (time per episode) than GPU -- I assumed this was because the observation space is pretty low-dimensional.

If you do need to train with GPU, I don't have a good idea of what causes this error. I would guess it is tensorflow that's trying to find cuda on your machine unsuccessfully

SacheetA commented 1 year ago

Thanks for the reply! Apparently, using 'fork' method in multiprocessing does not allow the use of GPU in child processes. Instead the 'spawn' method needs to be used. I tried that and the model is able to train on GPU with some limitations. As mentioned by you, there seems to be no gain in performance. Also, I felt GPU might be needed if we were to train on image dataset.