official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
10.86k stars 2.21k forks source link

SF NNUE #2728

Closed adentong closed 3 years ago

adentong commented 4 years ago

There has been much discussion on SF NNUE, which apparently is already on par with SF10 (so about 70-80 elo behind current sf dev). People have been saying it can become 100elo stronger than SF, which would basically come from the eval. Since the net is apparently not very big, maybe someone can study the activations of each layer and see if we can extract some eval info from it? In any case, it's probably worth looking into this since it shows so much promise.

NKONSTANTAKIS commented 3 years ago

Handcrafted eval is very hard to develop, SF10->SF11 gained only 15 eval elo in 1 year. And now we have the first primitive nets already so much superior with a search which is completely optimized for the handcrafted eval. So the search definitely has to split as well, in order to further unlock the NNUE potential.

It can be a shocking realisation that handcrafted eval was abruptly obsoleted. Its asset was speed, so it could battle neck and neck with Leela, but NNUE is 60% as fast, not 1000 times slower.

So I would say to prioritize optimizing SF NNUE, but of course emotionality is understandable, and also developing eval for the fun of it, so why not let all options.

NKONSTANTAKIS commented 3 years ago

@adentong Lets sloppily say x2 speed = 50 elo, and NNUE is 50 elo ahead of vanilla. Lets also pretend that SF search is equally efficient for NNUE and vanilla. This more or less means NNUE eval is 100 elo ahead, so if we increase eval progress to 20 elo/year instead of 15, it will take us 5 years to reach the current performance of NNUE, for which it took a few weeks at a home pc.

So I think I can safely abandon my successful career of translating chess-oriented logic into rationale, which in turn had to be translated into coding logic, and focus at areas where I truly shine, such as statistics :kappa:

FireFather commented 3 years ago

No need to abandon 'hand-crafted' eval IMO A UCI option could be used to simply turn NNUE 'on,' & 'off'

Vizvezdenec commented 3 years ago

I agree with Norman.
Also we should think of how we can use fishtest to actually make training of NNUE. I think that we can achieve much better results if we use fishtest resources than someone who used his 5 machines for this... But it requires a lot of work from maintainers and fishtest admins ofc.

FireFather commented 3 years ago

change evaluate.cpp line 895 to UCI option

if !defined(EVAL_NNUE)

Value Eval::evaluate(const Position& pos) { return Evaluation(pos).value(); }

endif // defined(EVAL_NNUE)

(and do the same in evalute_nnue.cpp) should work I think

vondele commented 3 years ago

I'll be reading a bit the code and try to generate my own net. That seems like a good step for any of the devs interested in this technology.

It is not just about having good Elo performance, we need people that understand the code, can maintain (or bug fix) and refine it. In lines of code, it is roughly doubling the current code base, but there are various parts to the code that are not directly the engine (i.e. the learning infrastructure). We already have a few SF regulars active, on the code, so that is a good start.

Right now, there are still non-Elo-related refinements that can easily go on in the @nodchip branch, for example, making sure it passes the typical CI process, or improving the comments, or making sure all architectures are supported at least in a basic form.

ssj100 commented 3 years ago

I'll be reading a bit the code and try to generate my own net. That seems like a good step for any of the devs interested in this technology.

Thanks @vondele - I'm normally not a fan of Discord (either), but would you consider participating there?

dorzechowski commented 3 years ago

@vondele While reading the code and trying to understand it I have done a bit of work that I have pushed to my branch of SF NNUE here: https://github.com/dorzechowski/Stockfish-nnue/tree/nnue-player-wip. Maybe it can be useful.

vondele commented 3 years ago

@dorzechowski useful work indeed. I'll have a look, but might only get to this for real next weekend.

The fact that sf nnue requires a more recent compiler might actually make this a little more difficult to deploy on fishtest, some people are still with older toolchains. Eventually, a first step could be to test such a variant of the code for non-regression on fishtest (or accurately measure the Elo loss), before one tries to tackle the more tricky aspect of enabling testing with different nets.

dorzechowski commented 3 years ago

@vondele I believe C++17 version is much more careful with alignment which is crucial for AVX2 instructions. The makefile in my branch requires C++17 but even if it doesn't compile on some machines, it's fine for now. Unfortunately we cannot choose requested CPU capabilities on creating the test but we should start trying anyway I think.

I think a good first test would be to treat it as "normal" Stockfish. I may change default UCI option to not use NNUE and it should at least compile, pass bench test and run some quick master vs master test on some machines. If this works then it would be possible to proceed from there. What do you think?

vondele commented 3 years ago

yes, test as a 'normal' Stockfish and do some non-regression test as a first step was what I was suggesting.

I can see the need for C++17... and in principle don't object. Just that it might be problematic for some older machines on fishtest. However, that's something we can eventually try to fix / workaround.

dorzechowski commented 3 years ago

Great, I' will try to push a test in a minute.

dorzechowski commented 3 years ago

@vondele Unfortunately, doesn't let me create a test. I get: image I don't know what's the problem with bench, looks correct to me. Here are my parameters: image

vondele commented 3 years ago

@dorzechowski I think it assumes that the master branch of the test repo (dorzechowski/Stockfish-nnue) is actually the SF master with the matching bench.... is that the case?

dorzechowski commented 3 years ago

@vondele Yes, it's updated to the latest master. But it complains about bench of base master, not test branch, I'm confused.

vondele commented 3 years ago

Base will still be a branch from your repository (not from official-stockfish)... Thus, if I look at https://github.com/dorzechowski/Stockfish-nnue/commits/master (which is what it will pick up as base), it will presumably fail to find the proper base signature (i.e. the latest bench in that branch won't match your number).

dorzechowski commented 3 years ago

Ah, that's right! I have correct master in Stockfish repo but not in Stockfish-nnue. I will push it to my Stockfish repo and try again.

Edit: test pushed!

vondele commented 3 years ago

looks like it fails to start. I guess the next hurdle is that the ARCH=x86-64-modern option has been removed, which is what the workers will use by default (IIRC). Maybe that could be hacked around using the x86-64-sse42 options in the makefile.

dorzechowski commented 3 years ago

Oh, I didn't notice that some options were removed. I will reintroduce x86-64-modern and try again. It's ok just for trying if it compiles but actual NNUE will be very slow on this arch.

Edit: pushed again.

vondele commented 3 years ago

yes, sure. Haswell introduced avx2, roughly 7 years ago, so I'm not too concerned if that would be the required 'modern' for NNUE.

vdbergh commented 3 years ago

A version of SF being tested on Fishtest does not have to be the strongest possible compile since it will only play against a small modification of itself. It is much more important that the barrier of entry for testers is low. So I think making C++17 mandatory would be a bad idea.

The existence of CFish shows that SF has zero need for any of the fancy C++ stuff.

vondele commented 3 years ago

@dorzechowski the current Elo performance (https://tests.stockfishchess.org/tests/view/5f154f61da64229ef7dc17ca) seems to come from a rather significant slowdown on the branch, just as measured by the nps of a bench (about 12% for me, roughly 26 Elo). I guess the origin of that would be useful to figure out & fix.

dorzechowski commented 3 years ago

@vondele This is a bit unexpected as I haven't seen any slowdown with my compiles (gcc 9.3.0, ARCH=x86-64-bmi2, Windows, CPU i7 Kaby Lake). Here are my results from fishbench (base is SF master, test is my nnue-player-wip branch):

Results for 20 tests for each version:

            Base      Test      Diff      
    Mean    1837380   1829760   7620      
    StDev   43427     40355     9416      

p-value: 0,209
speedup: -0,004

I noticed that all machines running this task on fishtest use Linux but I cannot really test on Linux right now. Also what CPU you have and which ARCH you used to compile it?

vondele commented 3 years ago

Strange... I used make -j ARCH=x86-64-modern profile-build on Linux, using gcc version 9.3.0. I'll check again, maybe it was a pilot error.

vondele commented 3 years ago

I see, likely due to different compiler flags being passed on master and branch, so a makefile issue. I have

-Wall -Wcast-qual -fno-exceptions -std=c++17 -fprofile-use -fno-peel-loops -fno-tracer -pedantic -Wextra -Wshadow -m64 -DNDEBUG -O3 -DIS_64BIT -msse -DUSE_POPCNT -DUSE_SSE2 -flto 

vs.

-Wall -Wcast-qual -fno-exceptions -std=c++11 -fprofile-generate -pedantic -Wextra -Wshadow -m64 -DNDEBUG -O3 -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -flto

Indeed, it seems to change the popcnt part of the Makefile

nodchip commented 3 years ago

I guess that we also need to merge the HEAD of https://github.com/nodchip/Stockfish, and add "sse3 = yes" to "x86-64-modern". Because the dorzechowski's Makefile does not add "-msse3" when "popcnt = yes".

https://github.com/nodchip/Stockfish/blob/master/src/Makefile#L104 https://github.com/dorzechowski/Stockfish/blob/nnue-player-wip/src/Makefile#L397 https://github.com/official-stockfish/Stockfish/blob/master/src/Makefile#L330

dorzechowski commented 3 years ago

@nodchip Thanks, I fixed my Makefile with pointed changes. @vondele I pushed the change, can you retest speed?

I didn't really do anything in Makefile except getting rid of nnue targets, I must have missed that it was changed before. I only used bmi2 arch and was happy with the performance.

vondele commented 3 years ago

@dorzechowski yes, looks good now.

dorzechowski commented 3 years ago

@vondele Great, I pushed the test again (stopped it before). Fingers crossed.

vondele commented 3 years ago

@noobpwnftw any idea why your workers are not able to join the test https://tests.stockfishchess.org/tests/view/5f156bf5da64229ef7dc17de ? One possible reason would be the used gcc version (needs to support C++17). If so, what do you use?

dorzechowski commented 3 years ago

@vdbergh We don't really use too many fancy C++17 stuff syntax-wise (and what we may have, we can live without). The point is that older versions don't handle AVX2 instructions properly. Sources can be compiled but if AVX2 are not aligned correctly, executable crashes at runtime. Not that it works but slower, it exits with core dump.

noobpwnftw commented 3 years ago

@vondele They use devtoolset-7. gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)

Going to update to devtoolset-9. Workers going offline for the change.

vondele commented 3 years ago

@dorzechowski the tests looks good, basically, consistent with no significant slowdown for normal running. A quick local tests using with nnue enabled shows performance comparable to what I've seen before (right now: -29.8 +/- 18.7 after about 500 games).

dorzechowski commented 3 years ago

@vondele That's good to hear. Now if we could make chosen nn.bin available for workers, we could in principle just set two UCI options: EvalFile=path/nn.bin and Use NNUE=true and try to run it on fishtest without any other changes. So far AVX2 code wasn't really executed, even if present in binary so I expect many rough edges depending on CPU/compiler/Makefile ARCH combination, It may even run SF dev no problem and crash running NNUE.

Perhaps the nets should be provided the same way as books, i.e. downloaded once from trusted Stockfish or fishtest repo. Certainly not good to make workers risk downloading a big binary file from some random github place. Is there a way to tell workers to download a specific file from official repo, provide checksum, etc.?

noobpwnftw commented 3 years ago

Why is it slower even with options turned off?

FireFather commented 3 years ago

I think the guys doing the training and producing the nets would be looking to test different evals.... so tracking that (via name/version etc.) would be very useful of course... Are you considering something to track nn.bin development as well?

vondele commented 3 years ago

@noobpwnftw as far as I can tell the speed is essentially the same.

dorzechowski commented 3 years ago

@noobpwnftw If it's slower for you, double check compile options. Speed is the same for me, see above https://github.com/official-stockfish/Stockfish/issues/2728#issuecomment-660908228.

noobpwnftw commented 3 years ago

Do you want bmi2 build on AMD? I currently use x86-64-bmi2 on Intels and x86-64-modern on AMDs. Currently it shows consistently slower performance on Intels with bmi2. Compiler version is: gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC). Is that something related to Makefile?

vondele commented 3 years ago

@FireFather @dorzechowski the next step is non-trivial, i.e. integration in fishtest. I don't know yet how to best do this, it will definitely need some fishtest development. At least the following issues need to be resolved:

suggestions welcome.

dorzechowski commented 3 years ago

@noobpwnftw These are my CXXFLAGS for x86-64-bmi2 from my branch Makefile: -Wall -Wcast-qual -fno-exceptions -std=c++17 -pedantic -Wextra -Wshadow -m64 -DNDEBUG -O3 -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE42 -msse4.2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE3 -msse3 -DUSE_SSE2 -DUSE_PEXT -mbmi2 -flto

For AMD Zen2 there is ARCH=x86-64-avx2, on older AVX2 is slow so they shouldn't be used for NNUE.

noobpwnftw commented 3 years ago

Mine says:

CXXFLAGS: -Wall -Wcast-qual -fno-exceptions -std=c++17  -pedantic -Wextra -Wshadow -m64 -DNDEBUG -O3 -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE42 -msse4.2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE3 -msse3 -DUSE_SSE2 -DUSE_PEXT -mbmi2 -flto
LDFLAGS:  -m64 -Wl,--no-as-needed -lpthread -Wall -Wcast-qual -fno-exceptions -std=c++17  -pedantic -Wextra -Wshadow -m64 -DNDEBUG -O3 -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE42 -msse4.2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE3 -msse3 -DUSE_SSE2 -DUSE_PEXT -mbmi2 -flto

/proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
stepping        : 3
microcode       : 0xd6
cpu MHz         : 3699.865
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch invpcid_single intel_pt ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear spec_ctrl intel_stibp flush_l1d
bogomips        : 6816.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
vondele commented 3 years ago

@noobpwnftw I seem to measure 1.5% slowdown using bmi2 (on zen2, branch vs master)... is that similar to your number ? There are a few extra branches (basically check for nnue being enabled) that could cause a slight slowdown. Other reason could of course be code generation due to the different flags.

noobpwnftw commented 3 years ago

@vondele Yes, a minor one, but it does run a bit slower for whatever reason.

dorzechowski commented 3 years ago

Options are the same I used. You have Skylake, I tested on Kaby Lake but it should be the same basically. Obviously some very slight slowdown is expected but this is no problem.

noobpwnftw commented 3 years ago

A Git repo containing frequent-changing binary files should be cloned for obvious reasons. However, you can still put many files there and have Github host them for you, like the current book repo. Fishtest can implement an extra field of input for such a file to be downloaded upon use and cache locally.

FireFather commented 3 years ago

Nice. And to submit an eval...maybe a button or link to open a dialog box for uploading the file, which would get placed in the testing queue. Currently nnue allows the nn.bin to be named anything... So perhaps a strict naming convention...some unique identifier may be needed.

nodchip commented 3 years ago

We could also need to let the net file creators to write how he or she created his or her net file. In detail,

These are necessary to confirm the reproducibility. Other net file creators will study good knowledge from those descriptions.

nodchip commented 3 years ago

By the way, I will stop modifying my repository. Because I don't want to interfere with the works in this thread.

If there are questions, or something that I can help you guys, please feel free to ask me.

dorzechowski commented 3 years ago

@vondele To start working on it on fishtest, I suggest taking one step at a time, use just one eval file for the time being and place it manually in the fishtest repo. Known good net is nn-256-gek2706-c157.bin, sha256: c157e0a5755b63e97c227b09f368876fdfb4b1d104122336e0f3d4639e33a4b1. If at least some workers can run it, it's a good start.

Then maybe push the branch nnue-player-wip to the official Stockfish repo (maybe with a better name), so that people can start working on it. There is certainly a lot of things that can be done before starting to test different nets, even in terms of just further adapting, optimizing and cleaning up the code. There are also obvious QOL improvements to do, such as reading network size/architecture from the file header (it's all hardcoded now and needs recompiling) or supporting gzipped nets (eval file is pretty sparse and compresses easily to at least 50% size).