pre-compiled lczero.exe with cuDNN support?

pw31 commented 6 years ago

Looking at https://docs.google.com/spreadsheets/d/1lGFf6PLGmBUSMan-YP7Vul4DpRNfn6K8oeCjBILe6uA/edit#gid=857482380, it seems that cuDNN instead of default CUDA can boost lczero performance. I tried to install cuDNN under Windows, see https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installwindows, but - as far as I understand - this is rather a development kit which can be used for compiling yourself under Windows (which I would not know how to). Would it be possible to provide precompiled Windows .exe with cuDNN support? Or is that too difficult / technically impossible? Thanks.

DoubleDoughnut commented 6 years ago

Here you can download .exe for lc0-cudnn version: https://crem.xyz/lc0/

However you will still need to install CUDA v9.0 to get cublas64_90.dll and cudart64_90.dll, and of course you will need to get cudnn64_7.dll from the cuDNN library.

You can also get the latest version of CUDA (which is v9.1 with the 3 patches installed) and get cublas64_91.dll and cudart64_91.dll from there and you can rename them to cublas64_90.dll and cudart64_90.dll respectively.

TL;DR

Get .exe for the cuDNN version from: https://crem.xyz/lc0/
Install CUDA v9.0 from https://developer.nvidia.com/cuda-90-download-archive (you need an account) and get cublas64_90.dll and cudart64_90.dll from the bin directory of CUDA v9.0 and copy it to the same directory as .exe (Alternatively if you don't want to install CUDA, you can just unzip the installer or look in the folder where it unpacks itself for the .dlls)
Get cuDNN from https://developer.nvidia.com/rdp/cudnn-download (you also need an account for this) and get cudnn64_7.dll from the bin directory of the zip and extract it to the same directory where .exe is.
Download the latest network from http://lczero.org/networks (it will autodetect it in any form either as weights.txt or however it is named) and put it in same folder as .exe
Now you can run the .exe and if you want to use it in a chess GUI and your preferred chess GUI requires the --uci argument, use uci as an argument instead (no quotes, no dash or double dash).

Hope this cleared up all your questions.

Keep in mind that you still can't submit training games, but from my testing cudnn provides 4x-5x performance boost compared to the standard gpu client so it should be stronger (~5500 nps on cuDNN vs ~1100nps on the official gpu client).

pw31 commented 6 years ago

Many thanks for your detailed message! Great service to provide these executables! I found cudnn64_7.dll from my cuDNN installation as you say. But unfortunately, I could not locate cublas64_90.dll nor cudart64_90.dll. I have latest NVIDIA driver installed which I think came with CUDA 9.1. When I run standard lczero, it says ...

lczero.exe -w network_ID227 Using 2 thread(s). Detecting residual layers...v2...192 channels...15 blocks. Initializing OpenCL. Detected 1 OpenCL platforms. Platform version: OpenCL 1.2 CUDA 9.1.84 Platform profile: FULL_PROFILE Platform name: NVIDIA CUDA Platform vendor: NVIDIA Corporation Device ID: 0 Device name: GeForce GT 525M Device type: GPU Device vendor: NVIDIA Corporation Device driver: 391.35 ...

so I guess cublas64_91.dll and cudart64_91.dll should be on my system, but I can't find them in C:\Windows\System32. I tried a system-wide search for cublas64_91.dll but without result. Any idea where these dll-s could be? Do I still need to install CUDA v9.1, as you say, which asks me to install NVIDIA driver R390 first?

Thank you!

pw31 commented 6 years ago

addition: I downloaded CUDA v9.1 as you suggested and double-clicked. However, instead of carrying out the installation, I just grabbed cublas64_91.dll and cudart64_91.dll from the temporary folder, copied them beside your executable, and renamed them as cublas64_90.dll and cudart64_90.dll as you suggested. I am hesitating to properly install CUDA 9.1 because the webpage says one should first switch to NVIDIA R390 driver which is not available for my GPU I'm afraid. I am now running into the following error ...

lc0-win-cuda90-cudnn712.exe -w network_ID227 -t 2 uci id name The Lc0 chess engine. id author The LCZero Authors. option name Network weights file path type string default option name Number of worker threads type spin default 2 min 1 max 128 option name NNCache size type spin default 200000 min 0 max 999999999 option name NN backend to use type combo default cudnn var cudnn var multiplexin g var random option name NN backend parameters type string default option name Scale thinking time type string default 2.000000 option name Minibatch size for NN inference type spin default 128 min 1 max 1024 option name Max prefetch nodes, per NN call type spin default 32 min 0 max 1024 option name Cpuct MCTS option type string default 1.700000 option name Initial temperature type string default 0.000000 option name Per move temperature decay type string default 0.000000 option name Add Dirichlet noise at root node type check default false option name Display verbose move stats type check default false option name Enable smart pruning type check default true option name Virtual loss bug type string default 3.000000 option name Do debug logging into file type string default uciok isready readyok go Creating backend [cudnn]... error CUDNN error: CUDNN_STATUS_ARCH_MISMATCH (C:/my/dev/leela-chess/lc0/src/neu ral/network_cudnn.cu:549)

DoubleDoughnut commented 6 years ago

EDIT: I totally forgot to check the most obvious thing - your GPU is GeForce GT 525M which is based on the Fermi architecture.

cuDNN lists the following architectures as supported: "cuDNN is supported on Windows, Linux and MacOS systems with Volta, Pascal, Kepler, Maxwell, Tegra K1, Tegra X1 and Tegra X2 GPUs."

So I guess your GPU is not supported unfortunately.

pw31 commented 6 years ago

Ah, that's too bad then (and good that I didn't try to install that CUDA driver then). Thank you very much anyway, I hope your post will help many other lczero users to boost their performance.

hsntgm commented 6 years ago

@DoubleDoughnut thank you. Very useful info.

I tried test match Lc0 vs Lczero on Windows 10 with Arena gui:

Lc0 does not use GPU while Lczero use GPU..Lczero utilization is around %40

Is there any missing file?

cublas64_90.dll cudart64_90.dll cudnn64_7.dll

libopenblas.dll libgcc_s_seh-1.dll libgfortran-3.dll libquadmath-0.dll libwinpthread-1.dll OpenCL.dll leelaz_opencl_tuning

Arena gui 3+0 Arena Book

Lc0-win-20180506-cuda90-cudnn712
command line = uci -w weights.txt --backend=cudnn --no-smart-pruning

Lczero-win-gpu v0.8 command line = uci -w weights.txt

Hardware GTX 650 Ti Boost

Engine Score Lc Lc S-B 1: Lczero 5.5/7 ······· ==1=111 8.25 2: Lc0 1.5/7 ==0=000 ······· 8.25

DoubleDoughnut commented 6 years ago

You aren't missing any files from what I can see (cuDNN version only needs 3 .dll files you listed and a network file to work).

Did you try running just the .exe file alone to see if it works properly? Try running a benchmark of the cuDNN version with the following arguments "--no-smart-pruning -t 2 --minibatch-size=256" and then just run "go nodes 130000".

Also, I think there is no need for specifying the weights file (both cuDNN and standard will detect weights.txt automatically) and I think there's no need to specify backend as cuDNN is default for that version anyways.

hsntgm commented 6 years ago

@DoubleDoughnut thank you for your reply.

Ok GPU-Z reads correct values. Windows 10 task manager shows wrong GPU utilization. Also confirm that renaming cuda 9.1 libraries with cuda 9.0 works.

But there is something interesting about Lc0 nps values and ram usage..

Begining on the match Lc0 gives higher nps then Lczero in the middlegame Lc0 nps drops and finally in endgame Lc0 nps dies :)

On the other hand ram usage increases till to endgame. End of the game she took 4GB ram and after the memory is full stopped to work.

Because of that she is not performing good againts Lczero ?

Engine Score Lc Lc S-B 1: Lczero 5.5/7 ······· ==1=111 8.25 2: Lc0 1.5/7 ==0=000 ······· 8.25

DoubleDoughnut commented 6 years ago

I just tested 3+0 and 20+0 time format and I did find that the memory usage of cuDNN version was way higher than standard v0.8 LcZero.

The cuDNN version had about 1.2GB maximum usage in the 3+0 matches (winning both sides of the board pretty easily) and 2GB maximum usage in 20+0 matches compared to around 300MB for v0.8 LcZero (in one very long 65 move game in 20+0 format the v0.8 Lczero suddenly spiked from 300MB to 700MB for an unknown reason).

I am not sure if it's a memory leak or the way that cuDNN functions by default, but it does seem to require way more memory (sometimes even 10x more) than the v0.8 LcZero.

I have 16GB of RAM on my machine so those higher usages weren't really a problem, but as for your machine I guess that it is a problem judging by the results you were getting in your tests.

mooskagh commented 6 years ago

For the information, see #331 for memory optimization for MCTS, but cuDNN backend may allocate some large blocks of memory too (it won't grow with number of nodes though).

Also there is currently a memory leak which is being fixed in #555

DoubleDoughnut commented 6 years ago

@mooskagh Thanks for the explanation!

hsntgm commented 6 years ago

@DoubleDoughnut "winning both sides of the board pretty easily" This is strange because of my test results says opposite of that.

I will try it on linux and will share the results.

I have 8gb ram and Lc0 takes it half and crash in 3 min game.

DoubleDoughnut commented 6 years ago

@hsntgm Try latest cuDNN .exe - it fixed the memory leak (not sure about it's playing strength).

hsntgm commented 6 years ago

@DoubleDoughnut thank you.

I completed Lc0 linux test like windows same situation.There is definitely something goes wrong about Lc0 playing strength against the Lczero.

In my test Lczero is better than Lc0 and her estimated elo 2600 on GTX 650 Ti Boost.

130 elo weaker than Spike 1.2 Turin linux.

I used silver opening suite which is optimised version of nunn's opening suite with 3 min blitz on arena gui without syz.tablebase support.

mooskagh commented 6 years ago

@hsntgm what are the exact conditions you run that matches? What time control? Also I saw you mentioning --no-smart-pruning. That's useful for benchmarking, but for actual games it should be off.

hsntgm commented 6 years ago

@mooskagh i fixed all command line options default I only use -w flag in arena.Time control was 3/0

go nodes 10000 ==> GTX 650 Ti Boost

Compare the nps scaling.In short time control OpenCL scaling better than cudnn on my hardware. Maybe gpu depended problem (compute capable 3.0 | cudnn 7.1.3 | cuda 9.0 | nvidia 384)

Lczero OpenCL-Linux Ubuntu 16.04

info depth 6 nodes 2 nps 17 tbhits 0 score cp 15 time 58 pv e2e4 c7c5 info depth 6 nodes 3 nps 30 tbhits 0 score cp 16 time 66 pv d2d4 d7d5 info depth 8 nodes 5 nps 49 tbhits 0 score cp 12 time 81 pv d2d4 d7d5 c2c4 info depth 9 nodes 13 nps 90 tbhits 0 score cp 16 time 132 pv e2e4 c7c5 c2c3 g8f6 info depth 10 nodes 17 nps 100 tbhits 0 score cp 14 time 159 pv e2e4 c7c5 c2c3 g8f6 e4e5 info depth 11 nodes 33 nps 125 tbhits 0 score cp 15 time 255 pv e2e4 c7c5 g1f3 d7d6 d2d4 c5d4 info depth 11 nodes 43 nps 134 tbhits 0 score cp 13 time 313 pv e2e4 c7c5 g1f3 d7d6 d2d4 c5d4 info depth 12 nodes 57 nps 142 tbhits 0 score cp 13 time 392 pv e2e4 c7c5 g1f3 d7d6 d2d4 c5d4 f3d4 ............ ............ info depth 20 nodes 6718 nps 234 tbhits 0 score cp 12 time 28738 pv e2e4 c7c5 g1f3 d7d6 f1b5 b8d7 e1g1 g8f6 f1e1 e7e6 c2c3 f8e7 d2d4 e8g8 e4e5 f6d5 c3c4 d5c7 e5d6 e7d6 bestmove e2e4

Result ==> depth 20 nodes 6718 time 28738 nps 234

Lc0 cudnn - Linux

info seldepth 2 time 171 nodes 2 score cp 19 hashfull 0 nps 11 pv e2e4 c7c5 info seldepth 3 time 240 nodes 3 score cp 15 hashfull 0 nps 12 pv e2e4 c7c5 g1f3 info seldepth 4 time 309 nodes 5 score cp 15 hashfull 0 nps 16 pv e2e4 c7c5 g1f3 d7d6 info seldepth 5 time 377 nodes 8 score cp 11 hashfull 0 nps 21 pv e2e4 c7c5 g1f3 d7d6 d2d4 info seldepth 6 time 510 nodes 12 score cp 14 hashfull 0 nps 23 pv e2e4 c7c5 g1f3 d7d6 d2d4 b8d7 info seldepth 7 time 578 nodes 14 score cp 12 hashfull 0 nps 24 pv e2e4 c7c5 g1f3 d7d6 d2d4 b8d7 info seldepth 8 time 781 nodes 31 score cp 13 hashfull 0 nps 39 pv e2e4 c7c5 g1f3 d7d6 d2d4 c5d4 f3d4 info seldepth 9 time 1007 nodes 49 score cp 13 hashfull 0 nps 48 pv e2e4 c7c5 g1f3 d7d6 d2d4 c5d4 f3d4 info seldepth 10 time 1388 nodes 102 score cp 15 hashfull 0 nps 73 pv e2e4 c7c5 g1f3 d7d6 d2d4 c5d4 f3d4 g8f6 b1c3 info seldepth 11 time 1548 nodes 132 score cp 15 hashfull 0 nps 85 pv e2e4 c7c5 g1f3 d7d6 d2d4 c5d4 f3d4 g8f6 b1c3 info seldepth 12 time 1683 nodes 152 score cp 15 hashfull 1 nps 90 pv e2e4 c7c5 g1f3 d7d6 d2d4 c5d4 f3d4 g8f6 b1c3 b8c6 ............ ............ info depth 2 seldepth 27 time 19311 nodes 6868 score cp 13 hashfull 28 nps 355 pv e2e4 c7c5 g1f3 d7d6 f1b5 b8d7 e1g1 g8f6 f1e1 e7e6 c2c3 f8e7 d2d4 e8g8 e4e5 f6d5 c3c4 d5c7 e5d6 e7d6 bestmove e2e4 ponder c7c5

seldepth 27 nodes 6868 time 19311 nps 355

hsntgm commented 6 years ago

/lc0/src/neural/network_cudnn.cu

 float alpha = 1.0f, beta = 0.0f;

  if (!(use_relu_ || use_bias_)) {
    reportCUDNNErrors(cudnnConvolutionForward(
        cudnn, &alpha, in_tensor_desc_, input, filter_desc_, weights,
        conv_desc_, convAlgo, scratch, kCudaScratchSize, &beta,
        out_tensor_desc_, output));
  } else if (input2) {
    // fused bias + sum + relu!
    reportCUDNNErrors(cudnnConvolutionBiasActivationForward(
        cudnn, &alpha, in_tensor_desc_, input, filter_desc_, weights,
        conv_desc_, convAlgo, scratch, kCudaScratchSize, &alpha,
        out_tensor_desc_, input2, bias_desc_, biases, activation_,
        out_tensor_desc_, output));
  } else {
    reportCUDNNErrors(cudnnConvolutionBiasActivationForward(
        cudnn, &alpha, in_tensor_desc_, input, filter_desc_, weights,
        conv_desc_, convAlgo, scratch, kCudaScratchSize, &beta,
        out_tensor_desc_, output, bias_desc_, biases, activation_,
        out_tensor_desc_, output));
  }
}

void FCLayer::eval(int N, float *outputTensor, const float *inputTensor,
                   const float *input2, float *scratch, cudnnHandle_t cudnn,
                   cublasHandle_t cublas) {
  float alpha = 1.0f, beta = 0.0f;
  int numOutputs = C * H * W;
  int numInputs = input->getC() * input->getH() * input->getW();

  if (fp16) {
    // TODO: implement this!
    assert(0);
  } else {
    cublasSgemm(cublas, CUBLAS_OP_T, CUBLAS_OP_N, numOutputs, N, numInputs,
                &alpha, weights_, numInputs, inputTensor, numInputs, &beta,
                outputTensor, numOutputs);

2.6. Scaling parameters alpha and beta Many cuDNN routines like cudnnConvolutionForward take pointers to scaling factors (in host memory), that are used to blend computed values with initial values in the destination tensor as follows: dstValue = alpha[0]computedValue + beta[0]priorDstValue. When beta[0] is zero, the output is not read and may contain any uninitialized data (including NaN). The storage data type for alpha[0], beta[0] is float for HALF and FLOAT tensors, and double for DOUBLE tensors. These parameters are passed using a host memory pointer.

Note: For improved performance it is advised to use beta[0] = 0.0. Use a non-zero value for beta[0] only when blending with prior values stored in the output tensor is needed.

ankan-ban commented 6 years ago

@hsntgm I didn't get what you mean. We are already using beta[0] = 0.0 for fully-connected layers, and for the convolutions that don't need skip connection. For convolutions that need skip connection, fusing the skip connection with convolution (using beta[0] = 1) is significantly faster than doing another pass just to add the skip connection. Do you see any other optimization opportunity?

ASilver commented 6 years ago

WIndows Task Manager sucks at reporting proper GPU usage. To see it properly, open Task Manager, click on the Performance tab, select GPU. You wil see a number of graphs, most likely near zero. Above them are names with a drop-down menu. Select one like Copy, and change it to Compute_0.

glinscott / leela-chess

pre-compiled lczero.exe with cuDNN support? #547