lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.37k stars 553 forks source link

CUDA 12/cudNN 8.7 'make' fails on cudahelpers with unsupported arch compute_35 #725

Open wirelesstevmc opened 1 year ago

wirelesstevmc commented 1 year ago

Hi LV,

Maybe you can suggest a CMakeList mod? This very same hardware compiles in earlier toolkits/libs - i.e. CUDA 11/cudNN 8.2 but I need this version of CUDA as it is required for my current NVIDIA driver NVIDIA-Linux-x86_64-525.60.13

My OS is Slackware64. Slackware 15.0+ t460s.attlocal.net 5.18.9 #1 SMP PREEMPT_DYNAMIC Sat Jul 2 20:59:36 CDT 2022 x86_64 AMD Ryzen 7 2700 Eight-Core Processor AuthenticAMD gcc-11.3.0 ... [ 34%] Building CXX object CMakeFiles/katago.dir/neuralnet/desc.cpp.o [ 35%] Building CXX object CMakeFiles/katago.dir/neuralnet/cudabackend.cpp.o [ 36%] Building CUDA object CMakeFiles/katago.dir/neuralnet/cudahelpers.cu.o nvcc fatal : Unsupported gpu architecture 'compute_35' make[2]: [CMakeFiles/katago.dir/build.make:615: CMakeFiles/katago.dir/neuralnet/cudahelpers.cu.o] Error 1 make[1]: [CMakeFiles/Makefile2:83: CMakeFiles/katago.dir/all] Error 2 make: *** [Makefile:91: all] Error 2

nvidia-smi -L GPU 0: NVIDIA GeForce GT 1030 (UUID: GPU-7b0de5ef-916e-fac6-64c1-57502d77d167) nvidia-smi Mon Jan 2 14:58:17 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:26:00.0 On | N/A | | 38% 34C P0 N/A / 30W | 401MiB / 2048MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1609 G /usr/libexec/Xorg 205MiB | | 0 N/A N/A 1847 G /usr/bin/kwin_x11 37MiB | | 0 N/A N/A 1885 G /usr/bin/plasmashell 37MiB | | 0 N/A N/A 2466 G ...AAAAAAAAA= --shared-files 22MiB | | 0 N/A N/A 4330 G ...668745254276985513,131072 93MiB | +-----------------------------------------------------------------------------+

lightvector commented 1 year ago

Thanks for the report. It's somehow surprisingly hard to find any comprehensive documentation on nvidia's website about the compatibility matrix between different versions of CUDA and different GPUs, but after a while I found one on wikipedia: https://en.wikipedia.org/wiki/CUDA#GPUs_supported

So I updated the cmakelists for the later cuda versions in this table in this commit https://github.com/lightvector/KataGo/commit/4725a698cfd290ef031f9bd4c95a4249fc5a1f51 which I pushed to master branch. Can you check if it works for you now?

wirelesstevmc commented 1 year ago

Yay! Thanks a lot for jumping right on this and fixing lightvector! .... [ 98%] Building CXX object CMakeFiles/katago.dir/command/tune.cpp.o [ 99%] Building CXX object CMakeFiles/katago.dir/main.cpp.o [100%] Linking CXX executable katago [100%] Built target katago /h//src/git/KataGo/cpp_}

wirelesstevmc commented 1 year ago

OK we both know that getting the binary to compile and getting it to run are two separate things. It seems the first binary was still looking for some of the -11 toolkit libs. So I tried rebuilding after creating symlinks of the -12 toolkit libs to point to -11 versions. That almost worked but not quite. Bummer... Seems some things are not linking properly from the new CMakeList file. Below is the output from benchmark check...

Your GTP config is currently set to use numSearchThreads = 5 Automatically trying different numbers of threads to home in on the best (board size 19x19):

2023-01-05 15:50:29-0800: GPU -1 finishing, processed 5 rows 5 batches 2023-01-05 15:50:29-0800: nnRandSeed0 = 850665103077862016 2023-01-05 15:50:29-0800: After dedups: nnModelFile0 = /home/cahill/.local/bin/KataGo/default_model.bin.gz useFP16 auto useNHWC auto 2023-01-05 15:50:29-0800: Initializing neural net buffer to be size 19 * 19 exactly 2023-01-05 15:50:29-0800: Cuda backend thread 0: Found GPU NVIDIA GeForce GT 1030 memory 2088828928 compute capability major 6 minor 1 2023-01-05 15:50:29-0800: Cuda backend thread 0: Model version 8 useFP16 = false useNHWC = false 2023-01-05 15:50:29-0800: Cuda backend thread 0: Model name: g170-b20c256x2-s5303129600-d1228401921

Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32,

numSearchThreads = 5: 1 / 10 positions, visits/s = 37.00 (21.7 secs) terminate called after throwing an instance of 'StringError' what(): CUDNN Error, for conv1 file /h/cahill/src/git/KataGo/cpp/neuralnet/cudabackend.cpp, func cudnnConvolutionForward( cudaHandles->cudnn, &alpha, inputDescriptors[batchSize], inputBuf, filterDescriptor, filterBuf, convolutionDescriptor, (*convolutionAlgorithms)[batchSize].algo, workspaceBuf, workspaceBytes, &beta, outputDescriptors[batchSize], outputBuf ), line 457, error CUDNN_STATUS_INTERNAL_ERROR Abort Exit 134

wirelesstevmc commented 1 year ago

Hey LV, I thought I should update this thread... this was a case of complete cockpit error :( I was using the wrong version of cudnn libs (8.7.0.84) needed for cuda toolkit version 12 which should be 8.8.0.121. I think the problem was the newer cudnn libs were not available at the time. Yikes - Living on the bleeding edge! KataGo compiles and runs no issue once I updated cudnn on my older GPUs.

tkuebler commented 1 year ago

I had two problems when I got this - the cudnn lib and cuda version mismatch. Once that was resolved it was that I had a lot of cuda versions sitting around ( been using it for a while ) and cmake was choosing the wrong version of nvcc apparently.. Settting -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc fixed it.

cmake . -DUSE_BACKEND=CUDA -DCMAKE_CXX_FLAGS='-march=native' -DBUILD_DISTRIBUTED=1 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc

tkuebler commented 1 year ago

Of course the first problem also required me to move to head to pick up a fix that just got release - so I can't contribute with my binary - which was the whole reason to compile it. :D. Hopefully next formal release won't break anything from master that is working today.

hadim commented 1 year ago

We are trying to rebuild katago for CUDA 12 on conda-forge at https://github.com/conda-forge/katago-feedstock/pull/5#issuecomment-1583105138.

Any recommendation how we could make the build to pass?