glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
760 stars 299 forks source link

Current bundled OpenBLAS lib is not scaling on higher threads #272

Open bunkbail opened 6 years ago

bunkbail commented 6 years ago

I have a 18 cores 36 threads Xeon E5-2686v3 (its an underclocked version of E5-2699v3) but the cpu version only yields around 1knps using all threads. After updating OpenBLAS to v0.2.19 gives me a huge nps boost (up to 14knps). For comparison i only get 3-4knps on my GTX 1060.

OpenBLAS lib that I'm using: https://sourceforge.net/projects/openblas/files/v0.2.19/OpenBLAS-v0.2.19-Win64-int32.zip/download

It wasn't static linked so here are all the libs I use for it to work (taken from latest MingW 7.3.0) + the OpenBLAS lib itself: https://drive.google.com/open?id=1ZLjM-mSYmuWXy5th2Xc2Tefuh-3yOhkL

Update: I compiled the latest stable release of OpenBLAS (v0.2.20) with static linking, so no more messy external libs needed. https://drive.google.com/open?id=1Ar3rZrM3PG8AFzAQMzkiRWPQOYduVvD3

Ipmanchess commented 6 years ago

That's a big gain! I have here a i9 7980XE 18c/36t ,is it possible to try your compiles..can you put them in a download link.. Thank you Ipman.

bunkbail commented 6 years ago

@Ipmanchess just download the google drive linked above and replace all the libs inside your lc0 folder.

Ipmanchess commented 6 years ago

Have download them..i thought i have to compile something.. so libs into folder..can use latest network and done..if it just like that,then i will let you know how it goes ;)

bunkbail commented 6 years ago

cool, keep us updated!

Ipmanchess commented 6 years ago

You are also on LCZero chat..

bunkbail commented 6 years ago

yes, bunkbail is my gamertag

Ipmanchess commented 6 years ago

Great thanks!

bunkbail commented 6 years ago

Sure no problem :)

Ipmanchess commented 6 years ago

Thanks these Libs works very well -> multiply it with 10 times more nodes/sec!! Will put screen on LCZero chat..

jjoshua2 commented 6 years ago

Me and another guy on discord. A pleasant illusion. independetly tested 2.19 and 2.15 and found they both worked better. It looks like our dll is 2.14.

We also tested a 2 percent gain for single threaded. We tested on Google and amazon cloud. And I tested on native threadripper.

The weird thing is that Chad on discord has no thread limiting problems with Windows. Same v4 official release.

2.20 is the latest but couldn't find a dll. Someone with mingw and visual studio needs to compile it. Its supposed to have a gain for ryzen and skylake. I tested on Linux for Windows subsystem and had no problem either (compiling on Ubuntu).

It also appears there is an amd blis with openblas support that might be a lot faster for amd.

On Mon, Apr 9, 2018, 5:19 AM Ipmanchess notifications@github.com wrote:

Thanks these Libs works very well -> multiply it with 10 times more nodes/sec!! Will put screen on LCZero chat..

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/272#issuecomment-379688943, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INLulut-LRscncvHiYJaAr4jYKXxcks5tmyeogaJpZM4TMAat .

Tilps commented 6 years ago

windows build seems to be done with appveyor using visual studio and nuget to source dependencies. OpenBlas nuget 2.14.1 is the latest. There is another nuget OpenBlasLib which is 2.19.3 - but that nuget specifically calls out processor architectures, so changing to it might restrict user base.

Both of these nugets are apparently unofficial. Maybe making our own unofficial nuget would be easier than trying to integrate binary drops in to the appveyor build directly. (That is assuming OpenBlasLib cpu arch restrictions are a problem.)

bunkbail commented 6 years ago

Currently I'm compiling the latest master (even newer than v0.2.20) from OpenBLAS github repo but can't figure out to compile it as generic 64-bit, it forces me to compile to my CPU arch (Haswell). I can't seem to make a statically linked compile too.

Ipmanchess commented 6 years ago

If you have something new..i can try it..also to have a compare!

bunkbail commented 6 years ago

Well, I've done compiling but the result wasn't satisfying at all. It exhibits same behavior as the included libs with lc0, stuck at 1knps with 36 threads. The make process took so much time too, as it doesn't support parallel compilation (make -jN). I'll see if other compile flags can improve things further or not.

bunkbail commented 6 years ago

I have a great news today! I've successfully compiled the latest stable release v0.2.20 with automatic CPU architecture detection with good speed scaling >4 threads. Download link here: https://drive.google.com/open?id=1JqfQDwvXUm1_f_mMxNMWUDRFYbtL0T3y

@Ipmanchess can do test it if its faster than v0.2.19 or not? I can't detect any measurable speedups on my machine.

PS: latest dev version v0.3.0-dev master branch has shitty scaling > 4 threads

bunkbail commented 6 years ago

Here's the statically linked compile of version 0.2.20, so you won't need to use any external libs at all (hopefully lc0 can be shipped with this version). I'll link this one onto the original post. https://drive.google.com/open?id=1Ar3rZrM3PG8AFzAQMzkiRWPQOYduVvD3

I did the same with latest master, the scaling was still as bad as it was before.

Ipmanchess commented 6 years ago

Here you have some benches done with LCZero v0.5 : https://mega.nz/#!TUYSWDDI!hDhJWhrE1WLgmgHgBYusS_Ux4p8T04VcEJLnHIbER60

When i download v0.5 and run it with default files..it goes very slow? (using -t 18) When using v0.2.19 it goes much faster and still best!! (-t 18) And new v0.2.20 goes better then default ,but almost half speed from v0.2.19! (-t 18) Default GPU v0.5 goes a little faster then default GPU v0.4 and using -t 8 (Nvidia 960 shows 8 CU)

Ipman.

Tilps commented 6 years ago

So I was doing some investigation for other reasons and I suspect that the appveyor build doesn't depend explicitly on nuget, it just uses nuget to source the dlls. Assuming there are no licensing issues, it is probably easier to commit the built dlls to the codebase, than to try and create a nuget out of them. Then just update the appveyor config to point to the committed location rather than the nuget package download path.

gcp commented 6 years ago

FWIW the original Leela Zero release bundles have an OpenBLAS from the latest development branch (0.3.x), compiled to support up to 64 cores.