Poor RX Vega performance

inversed-ru commented 6 years ago

I have an RX Vega 56 and my performance is way too low. I took the measurements by starting lczero and running "go infinite" command until depth 26 (Id 211, 10x128). I got 1200 nps, which is a terrible result. For comparison, R9 Fury achieves 1700 nps and R9 380 - 940 nps. These are previous generation GPUs, and Vega is a compute beast! According to GPU-Z, GPU load is 87%, memory and core clocks are at their intended values. I have asked another Vega owner on Discord and he is experiencing the same issue. Changing the number of threads and running --full-tune did not help.

System info: Windows 7 x64, Radeon Software Version 18.3.4. lczero console output: Using 2 thread(s). Detecting residual layers...v2...128 channels...10 blocks. Initializing OpenCL. Detected 1 OpenCL platforms. Platform version: OpenCL 2.1 AMD-APP (2527.10) Platform profile: FULL_PROFILE Platform name: AMD Accelerated Parallel Processing Platform vendor: Advanced Micro Devices, Inc. Device ID: 0 Device name: gfx900 Device type: GPU Device vendor: Advanced Micro Devices, Inc. Device driver: 2527.10 (PAL,HSAIL) Device speed: 1590 MHz Device cores: 56 CU Device score: 1121 Device ID: 1 Device name: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz Device type: CPU Device vendor: GenuineIntel Device driver: 2527.10 (sse2,avx) Device speed: 3400 MHz Device cores: 8 CU Device score: 521 Selected platform: AMD Accelerated Parallel Processing Selected device: gfx900 with OpenCL 2.1 capability. Loaded existing SGEMM tuning. Wavefront/Warp size: 64 Max workgroup size: 256 Max workgroup dimensions: 1024 1024 1024 BLAS Core: Sandybridge

inversed-ru commented 6 years ago

I've just noticed that the tuner might not be working correctly. When I run the tuning, the CPU load hits 100% (single core) and the GPU load remains at 0%. Might be related to #301.

Here is the tuning file: 0;XgemmBatched;128;16;128;16; -DKWG=32 -DKWI=2 -DMDIMA=16 -DMDIMC=16 -DMWG=16 -DNDIMB=16 -DNDIMC=16 -DNWG=16 -DSA=1 -DSB=1 -DSTRM=0 -DSTRN=0 -DVWM=1 -DVWN=1;OpenCL: Advanced Micro Devices, Inc. gfx900 @ 1590MHz

gcp commented 6 years ago

No, this is the expected result. During the tuning, the CPU has to compile all the possible kernels for your device, whereas the GPU only has to execute them a few times to get a benchmark result.

So, while your CPU is busy compiling the next kernel, the GPU just idles. What you are seeing is expected and totally unrelated to the issue you mention.

Not clear why Vega should have worse performance than older AMD cards, though.

gcp commented 6 years ago

-DVWM=1 -DVWN=1

This is rather suspect. In fact, in the upstream Leela Zero code I removed this configuration because it's so unlikely to be faster for modern cards. It looks like the Vega driver might have problems profiling the kernel's performance reliably.

You can try changing this to -DVWM=4 -DVWN=4 and see what happens. I'll try to post an alternative tuning for a Polaris card to compare with too, give me a bit.

gcp commented 6 years ago

Alternative 1 -DKWG=16 -DKWI=8 -DMDIMA=16 -DMDIMC=16 -DMWG=64 -DNDIMB=8 -DNDIMC=8 -DNWG=64 -DSA=1 -DSB=1 -DSTRM=0 -DSTRN=0 -DVWM=4 -DVWN=4

Alternative 2 -DKWG=16 -DKWI=2 -DMDIMA=8 -DMDIMC=8 -DMWG=64 -DNDIMB=8 -DNDIMC=8 -DNWG=32 -DSA=1 -DSB=1 -DSTRM=1 -DSTRN=1 -DVWM=4 -DVWN=2

Look inside your leelaz_opencl_tuning file and replace the values in there by the values above. Try both alternatives.

Does this improve performance on the Vega?

inversed-ru commented 6 years ago

Changing DVWM and DVWN to 4 leads to errors ("Error in OpenCL calculation: expected x got y"). The configuration you posted works, but results in a ~30% performance drop.

gcp commented 6 years ago

Thanks. I wonder if someone has a Vega on Linux and can compare, this might give a clue if it's a driver interaction issue, or if the kernels are somehow not good on Vega.

glinscott / leela-chess

Poor RX Vega performance #459