dqn_torch_test build failure

tacertain commented 7 months ago

I am trying to build Open Spiel for the first time. Using a new Ubuntu 22.04 install under WSL2. I am building with Cuda and libtorch. After a bunch of tinkering, I have gotten down to everything builds and only a single test failure:

terminate called after throwing an instance of 'c10::Error'
  what():  masked_fill_ only supports boolean masks, but got mask with dtype int
Exception raised from masked_fill_impl_cpu at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1910 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f3af2ac42ac in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f3af2a6dcbc in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x20b2d4d (0x7f3ad8cb2d4d in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #3: at::native::masked_fill__cpu(at::Tensor&, at::Tensor const&, c10::Scalar const&) + 0x49 (0x7f3ad8cb2df9 in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #4: at::_ops::masked_fill__Scalar::call(at::Tensor&, at::Tensor const&, c10::Scalar const&) + 0x16f (0x7f3ad99ad1df in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #5: at::native::masked_fill(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0xd1 (0x7f3ad8cd6ae1 in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x306bcbb (0x7f3ad9c6bcbb in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #7: at::_ops::masked_fill_Scalar::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x92 (0x7f3ad993e592 in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x4d4fd2c (0x7f3adb94fd2c in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x4d5038e (0x7f3adb95038e in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #10: at::_ops::masked_fill_Scalar::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x17a (0x7f3ad99a4c9a in /home/certain/GitHub/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x3fc95f (0x56439005495f in /home/certain/GitHub/open_spiel/build/algorithms/dqn_torch/dqn_torch_test)
frame #12: <unknown function> + 0x3fa7c1 (0x5643900527c1 in /home/certain/GitHub/open_spiel/build/algorithms/dqn_torch/dqn_torch_test)
frame #13: <unknown function> + 0x814cd (0x56438fcd94cd in /home/certain/GitHub/open_spiel/build/algorithms/dqn_torch/dqn_torch_test)
frame #14: <unknown function> + 0x29d90 (0x7f3a80c77d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #15: __libc_start_main + 0x80 (0x7f3a80c77e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x80c95 (0x56438fcd8c95 in /home/certain/GitHub/open_spiel/build/algorithms/dqn_torch/dqn_torch_test)

At some point when I was having more issues, I did get some warning about uint8 vectors being deprecated in some context and to use bools, but I didn't write it down - sorry.

Is this something I should worry about or that you want me to troubleshoot more directly? I will say that the whole Cuda/pyTorch/WSL marriage seems a little janky and so I am totally willing to believe I screwed something up there.

lanctot commented 7 months ago

Wow, that's amazing. I'm happy that you got this far.. we have not been actively supporting libtorch based code for a while, nice to know it mostly still works (or at least compiles!) and under WSL with Cuda nonetheless.

Ok, so the error reminded me of something I fixed on the python side when we moved to PyTorch 2. We had to change the way were doing masks to use booleans. Take a look at the changes indqn.py in here: https://github.com/google-deepmind/open_spiel/pull/1141/files

Maybe we now need to change that on the C++ side. My guess is you're using a LibTorch version that is in line with PyTorch 2, but the last time we tested anything with LibTorch it was likely 1.10 (based on the link here).

If that's the case, then maybe we're in luck and the fix might be as easy as it was on the Python side, but I'm not sure how to do it. But, if you're not planning to use C++ DQN then you're probably fine to ignore it. It'd be great to get the libtorch code working with V2 though.

tacertain commented 7 months ago

Well, the most-obvious translation of the python change doesn't seem to be what's needed, as it's already defined to be a bool (at least on first glance): https://github.com/tacertain/open_spiel/blob/1208f832568063c36d0e3076069e103d5b00cf5b/open_spiel/algorithms/dqn_torch/dqn.cc#L170

I will poke around some more tomorrow. I'm going to try to build unoptimized with symbols, etc to get a better stack trace. I am not familiar with cmake (other than as a naive user), so if there's any pointers to making it build that way, lemme know. Otherwise I'll probably just try to get the compilation lines out and run them by hand with the right flags.

tacertain commented 7 months ago

I tried changing the build type to Debug in build_and_run_tests.sh, but I still didn't get the symbols in the stack trace. Going to fall back to trying to run the compilation by hand.

lanctot commented 7 months ago

Hi @tacertain,

You need to set this environment variable, here: https://github.com/google-deepmind/open_spiel/blob/1208f832568063c36d0e3076069e103d5b00cf5b/open_spiel/CMakeLists.txt#L43

But then you have to entirely get rid of the build/ directory and redo from scratch (CMake has to run on a fresh empty build dir).

Then, you need to run dqn_torch_test within gdb. You might not get line numbers inside the libtorch functions unless the library has been built with debug symbols.

lanctot commented 7 months ago

I think you can set the build type to debug directly in CMakeLists.txt too, which might be easier.

lanctot commented 6 months ago

Fixed by https://github.com/google-deepmind/open_spiel/pull/1219 which is now merged. Thanks @tacertain!

google-deepmind / open_spiel

dqn_torch_test build failure #1216