Support for newer CUDA drivers?

CasparQuast commented 1 year ago

Hello,

im working on a game implementation and want to train an Agent for the Game with the AlphaZero approach. I managed to compile and run tests with the workaround described https://github.com/google-deepmind/open_spiel/issues/966 there.

I had to also remove the dqn_torch examples and all references to this in every CMake file to successfully run all tests. Some files from those examples are trying to import a game for the test suite which isnt existing in this repo anymore.

My Question now is whether i can use newer CUDA drivers for example 12.2 with latest Cudnn. In the global_variables.sh i could only see the options 10.2 and lower which is very old. Did somebody test newer CUDA drivers?

According NVIDIA Toolkit theres not even a 10.2 Cuda version which is supported by Ubuntu 22 (which is the recommended OS for this project) just 18 and lower. https://developer.nvidia.com/cuda-10.2-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu

If this whole approach is just there to be viewed by externals and not really maintained, is there maybe a better recommended alternativ to using ./alphazero_torch_example ? My goal is just to use my RTX 3070TI with CUDA/Cudnn to accelerate the training process which seemed to be straight forward at first sight, but appears to be quite hard to even setup. Is there anyone who recently experimented with this or can help me with some tipps? Thanks in advance :+1:

lanctot commented 1 year ago

My Question now is whether i can use newer CUDA drivers for example 12.2 with latest Cudnn. In the global_variables.sh i could only see the options 10.2 and lower which is very old. Did somebody test newer CUDA drivers?

According NVIDIA Toolkit theres not even a 10.2 Cuda version which is supported by Ubuntu 22 (which is the recommended OS for this project) just 18 and lower. https://developer.nvidia.com/cuda-10.2-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu

You mean this? https://github.com/google-deepmind/open_spiel/blob/a95a9d6589a57e9c7587aa228785aae16f7597f3/open_spiel/scripts/global_variables.sh#L83

Those are external links. They are hosted on pytorch.org. All you need to do is find the one you want. You can get those URLs by visiting this page (https://pytorch.org/get-started/locally/) and pick the selections that fit your environment. E.g. I just found the link for CUDA 12.1 on Linux.

If this whole approach is just there to be viewed by externals and not really maintained, is there maybe a better recommended alternativ to using ./alphazero_torch_example ?

From the README:

Important note: this implementation was a user contribution (see https://github.com/deepmind/open_spiel/pull/319), and is not regularly tested nor maintained by the core team. This means that, at any time, it may not build or work as originally intended due to a change that will not have been caught by our tests. Hence, if bugs occur, please open an issue to let us know so we can fix them.

Unfortunately, we simple don't have the resources to maintain this, so you will have to customize it yourself (though I don't see why it wouldn't work with the correct version of LibTorch unless it's not backward-compatible).

We could use some help from the community, though! For example, things like this:

I had to also remove the dqn_torch examples and all references to this in every CMake file to successfully run all tests. Some files from those examples are trying to import a game for the test suite which isnt existing in this repo anymore.

Could I ask you to submit a PR with those fixes? That would not only help us out, but every other user would benefit as well. So we'd be quite grateful for the help.

CasparQuast commented 1 year ago

First of all thanks for the quick reply. Id be very happy to prepare a PR. I initially thought those Comments with Cuda versions are the only ones which are supported. Ill try around and finish my Setup. When i get my Game to train on my GPU with the examples, i will commit a PR. :+1:

CasparQuast commented 1 year ago

I have tried out alot. Using the old recommended Pytorch install links from global_variables.sh and newer versions with according local CUDA/cudnn installations. I still somehow do not get rid of following error when trying to execute open_spiel/build/examples/alphazero_torch_example: ( i rerun ./install and build run all tests and everything suceeded)

I used the latest Pytorch CUDA link inside global_variables.sh (GPU): CUDA 10.2 https://download.pytorch.org/libtorch/cu102/libtorch-cxx11-abi-shared-with-deps-1.5.1.zip

(venv) (base) caspar@caspar-5801:~/repos/open_spiel/build/examples$ ./alpha_zero_torch_example 
Logging directory: /tmp/az
Creating model: /tmp/az/vpnet.pb
Playing game: tic_tac_toe
Loading model from step 0
[W TensorCompare.cpp:519] Warning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (function operator())
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 128 n 1024 k 9 mat1_ld 9 mat2_ld 9 result_ld 128 abcType 0 computeType 68 scaleType 0
Exception raised from gemm_and_bias at ../aten/src/ATen/cuda/CUDABlas.cpp:813 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f363b06e38b in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7f363b068f3f in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x31b6135 (0x7f35cfbb6135 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x31e372d (0x7f35cfbe372d in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x2f51541 (0x7f35cf951541 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x2f515f0 (0x7f35cf9515f0 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cuda.so)
frame #6: at::_ops::addmm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0xab (0x7f361fffdc2b in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x3a6a247 (0x7f3621c6a247 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x3a6afc2 (0x7f3621c6afc2 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #9: at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1b1 (0x7f3620062b41 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #10: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xb3 (0x7f36233cd743 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x413e5b (0x55ede6fc9e5b in ./alpha_zero_torch_example)
frame #12: <unknown function> + 0x4172e0 (0x55ede6fcd2e0 in ./alpha_zero_torch_example)
frame #13: <unknown function> + 0x417bb1 (0x55ede6fcdbb1 in ./alpha_zero_torch_example)
frame #14: <unknown function> + 0x42b00c (0x55ede6fe100c in ./alpha_zero_torch_example)
frame #15: <unknown function> + 0x4022b8 (0x55ede6fb82b8 in ./alpha_zero_torch_example)
frame #16: <unknown function> + 0x4067bc (0x55ede6fbc7bc in ./alpha_zero_torch_example)
frame #17: <unknown function> + 0x8bf60 (0x55ede6c41f60 in ./alpha_zero_torch_example)
frame #18: <unknown function> + 0x29d90 (0x7f35c5829d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: __libc_start_main + 0x80 (0x7f35c5829e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: <unknown function> + 0x8af25 (0x55ede6c40f25 in ./alpha_zero_torch_example)

Aborted (core dumped)

is this problem known? Only thing i could find researching about this error is that it might be caused by an out of memory CUDA exception but i simply changed the device from cpu to gpu for training and my PC is quite powerful. Rtx 3070ti and 16gb ram which was enough to run the examples on my CPU, so this cant be the issue

TheSQLGuru commented 1 year ago

Is there a way that @CasparQuast can post up the code here at the point he has it - a branch maybe? Then others could try out what he has and collaborate to try to move the ball forward. It would be really great if we could eventually get things working on the latest versions of the entire Torch/CUDA/etc stack. I could try to get it going on Windows, which may be able to offer some additional debugging capabilities. But I recently had to blow away my dev environment, and sadly real life has prevented me from rebuilding it just yet. :(

Speaking of debugging, do you have nVidia's full-stack profiling/debugging tooling installed? I am uncertain what they expose in Linux, but I would actually expect it to be more robust than what is offered in Windows, at least when it comes to CUDA development.

lanctot commented 11 months ago

@TheSQLGuru I agree, that would be great. Will require some community coordination. @CasparQuast, are you willing to post your code up somewhere in a fork or pull request (which creates a branch)?

(Apologies for the late reply!)

CasparQuast commented 11 months ago

What i currenctly have is a Cuda 11.3 Version. Not the newest but newer than the latest Version from the project documentation. I didnt manage to get the newest Cuda to run with the torch version. Next week my University project will be over and then ill create a pull request for my game implementation and updated config files. The only thing i changed basically is the link inside the install script to match the torch gpu download link for CUDA 11.3 (https://download.pytorch.org/libtorch/cu113/libtorch-cxx11-abi-shared-with-deps-1.10.1%2Bcu113.zip) and i followed the already posted Bug bypasses to build successfully (https://github.com/google-deepmind/open_spiel/issues/966).

google-deepmind / open_spiel

Support for newer CUDA drivers? #1129