lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.56k stars 564 forks source link

About the config.cfg setup #666

Open wuchaopmp opened 2 years ago

wuchaopmp commented 2 years ago

Hello! Ask the following questions about the config.cfg setup. B450 motherboard, Amd 7-2700 Cpu, 2080TI + 2070 graphics card, 32GB memory katago-v1.11.0-cuda11.2-windows-x64 “katago genconfig -model b40.bin.gz -output cuda2gpu.cfg” Can operate normally

katago-v1.11.0-trt8.2-cuda11.2-windows-x64 “katago genconfig -model b40.bin.gz -output trt2gpu.cfg” It cannot run normally. The problem is as follows: I hope to get your help, thank you!

C:\GO\katago-v1.11.0-trt8.2-cuda11.2-b40>katago genconfig -model b40.bin.gz -output trt2gpu.cfg

about the config.cfg setup.txt

lightvector commented 2 years ago

Hi. It sounds like there might be an issue with your TensorRT installation. Are you sure you're using all the correct versions betweeen CUDA and TensorRT, and that you installed it correctly, and that your drivers are up to date? The file you posted indicates that basically your TensorRT was returning bad numeric values (infinite or nonsensical values) when KataGo tried to use it to run your GPU.

wuchaopmp commented 2 years ago

nvidia-smi.txt

C:\GO\katago-v1.11.0-cuda11.2-b40>nvidia-smi |NVIDIA-SMI 516.59 Driver Version: 516.59 CUDA Version: 11.7 。。。

C:\GO\katago-v1.11.0-trt8.2-cuda11.2-b40>nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Mon_Nov_30_19:15:10_Pacific_Standard_Time_2020 Cuda compilation tools, release 11.2, V11.2.67 Build cuda_11.2.r11.2/compiler.29373293_0

genconfig-cuda.txt genconfig-opencl.txt genconfig-trt.txt AI running results.txt trt-2gpu.txt opencl_2gpu.txt AI running results.txt

wuchaopmp commented 2 years ago

Thank you for your reply. I sorted and uploaded some materials yesterday, hoping to help solve the problem. I hope to get your guidance again.

lightvector commented 2 years ago

That's a lot of data, and I'm not an expert on GPUs. Try going to https://discord.gg/bqkZAz3 and posting in the help channel and see if anyone else can give you useful suggestions.

lightvector commented 2 years ago

I went back and I took another look through your logs and configs, but I still don't think I know how to help, sorry.

It still just looks to me like you must have some driver or installation issue relating to TensorRT, or there is something wrong with your versions of TensorRT relative to your CUDA version or some other incompatibility. Basically, something is wrong with your TensorRT installation (it could be something that has nothing to do with KataGo), so that when KataGo tries to use your TensorRT, it gets bad values back.

The only thing I can recommend is to try updating your drivers, uninstalling and reinstalling TensorRT, or finding and asking for help from other users that might have more experience than me with TensorRT.

Otherwise, unless you really need TensorRT for some reason, you could just use the versions that do work for you, OpenCL or CUDA.

wuchaopmp commented 2 years ago

Thank you for your help.

CUDA can normally use two graphics cards.

Thank you for developing such an excellent software for us.

13802880823

@.***

13802880823

电子名片新出VIP模板啦,快来体验>>

扫一扫,

快速添加名片到手机


------------------ 原始邮件 ------------------

发件人: lightvector @.***>

发送时间: 2022-08-03 12:28:51

收件人:lightvector/KataGo @.***>

抄送:13802880823 @.>,Author @.>

主题: Re: [lightvector/KataGo] About the config.cfg setup (Issue #666)

I went back and I took another look through your logs and configs, but I still don't think I know how to help, sorry.

It still just looks to me like you must have some driver or installation issue relating to TensorRT, or there is something wrong with your versions of TensorRT relative to your CUDA version or some other incompatibility. Basically, something is wrong with your TensorRT installation (it could be something that has nothing to do with KataGo), so that when KataGo tries to use your TensorRT, it gets bad values back.

The only thing I can recommend is to try updating your drivers, uninstalling and reinstalling TensorRT, or finding and asking for help from other users that might have more experience than me with TensorRT.

Otherwise, unless you really need TensorRT for some reason, you could just use the versions that do work for you, OpenCL or CUDA.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>