lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.58k stars 566 forks source link

Issues with T550 Laptop OpenCL #785

Closed massimopavoni closed 1 year ago

massimopavoni commented 1 year ago

Couldn't get cuda to work and wanted to try opencl before also trying tensorrt, but just after doing tuning for genconfig (or with the default config too), I always get a memory shortage error. Is my GPU too weak for the engine? Am I better off trying the eigen version with the i7-1260P I have?

------------------------------------------------------
2023-05-09 20:58:01+0200: Done tuning, saved results to /home/USER/.katago/opencltuning/tune11_gpuNVIDIAT550LaptopGPU_x19_y19_c384_mv11.txt
2023-05-09 20:58:01+0200: OpenCL backend thread 0: Device 0 Model version 11
2023-05-09 20:58:01+0200: OpenCL backend thread 0: Device 0 Model name: kata1-b18c384nbt-s5832081920-d3223508649
2023-05-09 20:58:02+0200: OpenCL backend thread 0: Device 0 FP16Storage true FP16Compute false FP16TensorCores false FP16TensorCoresFor1x1 false
terminate called after throwing an instance of 'StringError'
  what():  OpenCL error at /home/dwugcloud/data/kata/cpp/neuralnet/openclbackend.cpp, func err, line 1329, error CL_OUT_OF_RESOURCES (possibly ran out of GPU memory?)
Aborted (core dumped)

Also, sorry for forgetting at first, the model is kata1-b18c384nbt-s5832081920-d3223508649

massimopavoni commented 1 year ago

Just a small addition: it seems to me that I wasn't able to use the cuda version because my installed version is 12.1 (the error is ./katago-cuda/katago: error while loading shared libraries: libcublas.so.11: cannot open shared object file: No such file or directory). I tried compiling it on my own, but had similar problems. I figure it won't be doable with tensorrt, but will probably try again.

massimopavoni commented 1 year ago

Found out the problem with OpenCL was my fault, for the big size of the network, while with CUDA I could fix everything by compiling with cuda 12.1 on the machine and by using the smaller model (a 15 block extended training which is the default for katrain). I'm hoping this can help anyone who is having similar problems.