lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.61k stars 569 forks source link

func cudaSetDevice(gpuIdxForThisThread), error unknown error #122

Closed lukaszlew closed 4 years ago

lukaszlew commented 4 years ago

I think this might be an issue with CUDA drivers. Do you know how to debug it?

~/devel/KataGo/cpp$ /home/lew/devel/KataGo/cpp/katago gtp -model /home/lew/devel/KataGo/cpp/models/b20c256-s447913472-d241840887/model.txt.gz -config /home/lew/devel/KataGo/cpp/configs/lew.cfg KataGo v1.3.2 Loaded model /home/lew/devel/KataGo/cpp/models/b20c256-s447913472-d241840887/model.txt.gz GTP ready, beginning main protocol loop terminate called after throwing an instance of 'StringError' what(): CUDA Error, for createComputeHandle file /home/lew/devel/KataGo/cpp/neuralnet/cudabackend.cpp, func cudaSetDevice(gpuIdxForThisThread), line 2706, error unknown error Aborted

lukaszlew commented 4 years ago

The OpenCL build has a similar error:

KataGo v1.3.2 terminate called after throwing an instance of 'StringError' what(): OpenCL error at /home/lew/devel/KataGo/cpp/neuralnet/openclhelpers.cpp, func err, line 188, error CL_PLATFORM_NOT_FOUND_KHR Aborted

Seems related, but I'm not sure.

lightvector commented 4 years ago

Do you have a GPU, and are its drivers up to date? Your errors suggest that both versions of KataGo - CUDA and OpenCL - are failing to find a GPU or a on your system.

If you read your error messages you can guess this - failing at "cudaSetDevice(gpuIdxForThisThread)" suggests CUDA is failing when it is trying to set the index of what GPU to use, and "CL_PLATFORM_NOT_FOUND_KHR" sounds like OpenCL cannot find a platform on your computer that has OpenCL or supports accelerated computation.

lightvector commented 4 years ago

Any updates on this? Is it resolved, or have you given up at this point? As mentioned above, unless you have other evidence, it seems the problem might just be that the GPU drivers are old or incorrect so that the GPU cannot be detected.

lightvector commented 4 years ago

Going ahead and closing. If you have more info and are sure that you do have a working GPU and/or that you have CUDA or OpenCL working but it still doesn't run, feel free to reply back or open a new issue.

npbool commented 4 years ago

same error. I have a working nvidia GPU with proprietary driver

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce GTX 960 Off | 00000000:01:00.0 On | N/A | | 0% 42C P0 29W / 120W | 1553MiB / 4040MiB | 4% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 857 G /usr/lib/Xorg 532MiB | | 0 N/A N/A 1611 G /usr/bin/kwin_x11 23MiB | | 0 N/A N/A 1794 G /usr/bin/plasmashell 155MiB | | 0 N/A N/A 17214 G ...gl=desktop --shared-files 727MiB | | 0 N/A N/A 21551 G /usr/bin/krunner 10MiB | | 0 N/A N/A 37360 G /usr/bin/python3 91MiB | +-----------------------------------------------------------------------------+

npbool commented 4 years ago

after some testing I find it's nvidia driver issue. 450.xx with CUDA 11 is incompatible. Downgrading to 440.xx with CUDA 10.2 solves the problem. It solves opencl issue, too. @lukaszlew

peepo commented 3 years ago

sudo apt-get install mesa-opencl-icd removed CL_PLATFORM_NOT_FOUND_KHR I had this issue using ubuntu 20.04 with Mesa DRI Intel HD Graphics 4000 card

however still barfing: "no OpenCL devices were found.. bugy drivers" fully up to date, and from my search post 17.04 GPU driver updates are automatic.