Closed tacertain closed 6 months ago
Wow, that's amazing. I'm happy that you got this far.. we have not been actively supporting libtorch based code for a while, nice to know it mostly still works (or at least compiles!) and under WSL with Cuda nonetheless.
Ok, so the error reminded me of something I fixed on the python side when we moved to PyTorch 2. We had to change the way were doing masks to use booleans. Take a look at the changes indqn.py
in here: https://github.com/google-deepmind/open_spiel/pull/1141/files
Maybe we now need to change that on the C++ side. My guess is you're using a LibTorch version that is in line with PyTorch 2, but the last time we tested anything with LibTorch it was likely 1.10 (based on the link here).
If that's the case, then maybe we're in luck and the fix might be as easy as it was on the Python side, but I'm not sure how to do it. But, if you're not planning to use C++ DQN then you're probably fine to ignore it. It'd be great to get the libtorch code working with V2 though.
Well, the most-obvious translation of the python change doesn't seem to be what's needed, as it's already defined to be a bool (at least on first glance): https://github.com/tacertain/open_spiel/blob/1208f832568063c36d0e3076069e103d5b00cf5b/open_spiel/algorithms/dqn_torch/dqn.cc#L170
I will poke around some more tomorrow. I'm going to try to build unoptimized with symbols, etc to get a better stack trace. I am not familiar with cmake (other than as a naive user), so if there's any pointers to making it build that way, lemme know. Otherwise I'll probably just try to get the compilation lines out and run them by hand with the right flags.
I tried changing the build type to Debug in build_and_run_tests.sh
, but I still didn't get the symbols in the stack trace. Going to fall back to trying to run the compilation by hand.
Hi @tacertain,
You need to set this environment variable, here: https://github.com/google-deepmind/open_spiel/blob/1208f832568063c36d0e3076069e103d5b00cf5b/open_spiel/CMakeLists.txt#L43
But then you have to entirely get rid of the build/
directory and redo from scratch (CMake has to run on a fresh empty build dir).
Then, you need to run dqn_torch_test
within gdb. You might not get line numbers inside the libtorch functions unless the library has been built with debug symbols.
I think you can set the build type to debug directly in CMakeLists.txt too, which might be easier.
Fixed by https://github.com/google-deepmind/open_spiel/pull/1219 which is now merged. Thanks @tacertain!
I am trying to build Open Spiel for the first time. Using a new Ubuntu 22.04 install under WSL2. I am building with Cuda and libtorch. After a bunch of tinkering, I have gotten down to everything builds and only a single test failure:
At some point when I was having more issues, I did get some warning about uint8 vectors being deprecated in some context and to use bools, but I didn't write it down - sorry.
Is this something I should worry about or that you want me to troubleshoot more directly? I will say that the whole Cuda/pyTorch/WSL marriage seems a little janky and so I am totally willing to believe I screwed something up there.