lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.56k stars 564 forks source link

TENSORRT compiling error with latest code #874

Open hwj-111 opened 10 months ago

hwj-111 commented 10 months ago

I installed the cuda/cudnn/tensorrt verions recommended by the 1.14 release note. However, I got following compiling errors at the end (seems a link error). Any suggestions? Thanks a lot ... [ 96%] Building CXX object CMakeFiles/katago.dir/command/selfplay.cpp.o [ 97%] Building CXX object CMakeFiles/katago.dir/command/tune.cpp.o [ 98%] Building CXX object CMakeFiles/katago.dir/command/writetrainingdata.cpp.o [ 99%] Building CXX object CMakeFiles/katago.dir/main.cpp.o [100%] Linking CXX executable katago /usr/bin/ld: /usr/local/cuda/lib64/libcudart_static.a(cudart_static.o): undefined reference to symbol 'pthread_rwlockattr_init@@GLIBC_2.2.5' /usr/bin/ld: /lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line collect2: error: ld returned 1 exit status make[2]: ** [CMakeFiles/katago.dir/build.make:1780: katago] Error 1 make[1]: [CMakeFiles/Makefile2:95: CMakeFiles/katago.dir/all] Error 2 make: * [Makefile:103: all] Error 2

hwj-111 commented 10 months ago

my cmake version: cmake --version cmake version 3.18.4

hwj-111 commented 10 months ago

And here is all my installation/compiling command history image

hwj-111 commented 10 months ago

I installed the latest cmake (3.28.1) and got the same error

lightvector commented 10 months ago

I'm curious what compiler and compiler version you are using. It looks like it can't find pthreads, which suggests maybe something isn't happening properly with these lines of code when you run CMake: https://github.com/lightvector/KataGo/blob/master/cpp/CMakeLists.txt#L440-L451

The cmake line in the linked lines above message(STATUS "Setting up build for GNU or Clang.") should print out the corresponding message - do you see this message when you run cmake? If you delete all the files cmake generated (delete cmake cache, cmake files, etc, leave only CMakeLists.txt) and rerun cmake, what does the output look like?

If you find that the cmake lines https://github.com/lightvector/KataGo/blob/master/cpp/CMakeLists.txt#L449-L451 are running, but you are still getting this error, then another guess: the particular file that the link error happens with when linking the cuda library. Is it also possible that you are using a different threading library than pthreads, but CUDA library from nvidia assumes you are using pthreads? If so, then does it also help to run cmake with -DTHREADS_PREFER_PTHREAD_FLAG=1? (as per https://github.com/Kitware/CMake/commit/b7e5c5a23ae8d429186873dd2095e2180f62f522)

hwj-111 commented 10 months ago

@lightvector, thanks a lot for helping on debug.

I finally "solved" the problem this morning.

My system (MX Linux 23.1) was Debian 12 ("bookworm"). Unfortunately, the "tensorrt-required" CUDA verison (12.1.1) only has debian10,11 supported (but I did not notice when I downloaded it). I installed everything but failed at nvidia driver compiling, and I tried to fix things manually, and screw up my apt system completely (could not go back to original status).

So I reinstalled my MX Linux. This time I installed an older version (Debian 11 based), but I kept my home partition reused (I can keep all my Katago-related files to continue work on). Then I ran into the issue I reported here.

The solution is to git clone a completely new repository to compile. The problem gone!

But it is a bit puzzle to me why the old Katago repo not working ( I did "git pull" and "make clean" ...)

lightvector commented 10 months ago

Ah, the answer is probably that CMake did not get cleaned when you run make clean. This is because make is on a lower-level layer than CMake - CMake creates a makefile, and then that makefile can now compile the program independently of CMake, and so cleaning with makefile doesn't clean CMake.

Current up-to-date versions of CMake supposedly offer a "clean" command, but I'm not always sure I trust it, so when in doubt the foolproof way to is to simply delete the cmake cache txt file and the cmake files directory, and then rerun cmake to create those files again from scratch.

hwj-111 commented 10 months ago

After successfully compiling of the latest tensorRT vesion, my benchmark test results did show about 5-10% performance increase on various network models on my RTX 3080 card (I found that I need to use at least visits number = 4x max visits/s your car can provide to get the reliable benchmark result). old 40b --- 3380 visits/s 18b --- 3960 visits/s 60b --- 1380 visits/s