Closed deveshjawla closed 4 years ago
This looks like the same issue that was also reported here: https://github.com/JuliaGPU/CUDA.jl/issues/447. It looks like a CUDA.jl issue but I'll dig deeper into this after I release v0.4.
Also, I am testing MDP support right now so it should be available soon. :-)
I believe it is the CUDA.jl v1.3.3 not playing well with CUDA 11.1
I will downgrade the NVIDIA drivers, CUDA and CuDNN to see if that works.
Because if I use
JULIA_CUDA_VERSION=11.1 julia --project --color=yes scripts/alphazero.jl --game connect-four train
ERROR: LoadError: InitError: CUDA.jl does not yet support CUDA with nvdisasm 11.1.74; please file an issue.
then it's the same as #23
Also, I am testing MDP support right now so it should be available soon. :-)
Thanks for this implementation man)), and yes MDP support would be very nice too. With time I hope to also contribute to this implementation.
So I tried to reduce the number of self play games, the batch size for learning and the number of workers in the params.jl of connect4. This seems to work on both CUDA 11.1 and CUDA 11.0. Now I am dealing with a variety of errors(see attached pics). Points to note:
And TicTACTOE trains without any errors. This is enough motivation to basically play around with parameters of connect4. Getting to know the implementation better should help.
Thanks for your help investigating this bug! It is very possible that the errors we have been seeing are legitimate "out-of-resources" error under disguise. That being said, the connect-four example runs on my machine, which has 16GB of RAM and a 8GB GTX 2070 GPU.
And TicTACTOE trains without any errors.
This is unsurprising as the tictactoe example is configured to run on CPU.
This is unsurprising as the tictactoe example is configured to run on CPU.
Yes And I tried connect4 with cpu-only, and it works 100% fine. Then I tested to put use_gpu=true for all except LeraningParams and it works fine as well, so the problem comes down to the LearningParams part. Hopefully soon Flux.jl is updated and it's CUDA.jl dependency is elevated to the latest version. Otherwise, maybe you could hint as to where to look, but I am enjoying exploring the implementation by myself also ;)
That being said, the connect-four example runs on my machine, which has 16GB of RAM and a 8GB GTX 2070 GPU.
Actually, I have the same config, RTX 2070 Super, and 125 GB RAM at the workplace, so I believe it is not the resources but really the compatibility of versions.
It is very possible that the errors we have been seeing are legitimate "out-of-resources" error under disguise.
It appears that you were correct. I reduced the training complexity whilst keeping use_gpu=true for all, and the training completed successfully.
But this is very strange, I can't understand why when the resources are the same between our machine, why it fails on mine.
Ok, I was using more filters than mentioned in the documentation. In the latest master branch you have increased the size of the ResNet, Setting num_filters=64 now runs without any problem. I guess we can close this now.
Hi Guys, I'm getting this error on the master branch. This is right after the self play has finished. Below the error you'll find that Julia sees CUDA and device correctly. But throws a CuDNN error, can you help?