official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
10.84k stars 2.2k forks source link

SF NNUE #2728

Closed adentong closed 3 years ago

adentong commented 4 years ago

There has been much discussion on SF NNUE, which apparently is already on par with SF10 (so about 70-80 elo behind current sf dev). People have been saying it can become 100elo stronger than SF, which would basically come from the eval. Since the net is apparently not very big, maybe someone can study the activations of each layer and see if we can extract some eval info from it? In any case, it's probably worth looking into this since it shows so much promise.

ZagButNoZig commented 4 years ago

I don't know if its the direction the devs want to go in but I think it should be considered to maybe integrate ML into SF given the impressive results.

vondele commented 4 years ago

We should be open-minded and see how things evolve... it is an interesting development. Let's see how the code base evolves, the performance goes, etc. Once we have some data and understanding, we should see what the opportunities are.

TesseractA commented 4 years ago

Given that Stockfish tunes in attempts to match Leela evaluations has failed in the past, I'm not entirely sure that you can extract much useful information from another similar black box, especially since neural networks have convolution structures that make them useful and less compressible.

EDIT: I found out (anecdotally) that this Neural net doesn't use convolutions. If you want to investigate, you should probably ask on the Stockfish discord or the fork mentioned by vondele below.

Caleb-Kang commented 4 years ago

I don't know much about SF NNUE. What is it? Does NNUE stand for something?

adentong commented 4 years ago

So it's been claimed on discord that NNUE is now 34elo stronger than SFDev.

gekkehenker commented 4 years ago

I don't think anybody claimed that besides the occassional SSS result.

NNUE definitely is much worse at 10+0.1 STC, but does quickly gain elo on SF_dev as TC increases.

vondele commented 4 years ago

Just for reference, this issue refers to the fork being developed here: https://github.com/nodchip/Stockfish with an eval function based on a neural net architecture.

ssj100 commented 4 years ago

Data is sounding more and more convincing on this (look at jjosh and lkaufman posts): http://talkchess.com/forum3/viewtopic.php?f=2&t=74366&start=10#p850204

"Anecdotally", I have several test positions which SF consistently takes up to 50-100 billion nodes or more (or sometimes never finds it) to find the correct move, that SF NNUE finds within a few million nodes. The difference is night and day.

Is there any chance fishtest resources could be used for this? Or if we could somehow run one of these "patches" (SF NNUE) against "master" with SPRT elo bounds at 180+1.8? I think it might pass very fast!

adentong commented 4 years ago

@ssj100 But look at the number of games though. It's not even thousands of games, just dozens. That's hardly convincing at all. I would, however, love to see an LTC match of NNUE vs SF, though I don't know if it's supported by fishtest (probably not). @vondele

Vizvezdenec commented 3 years ago

well, I think we should slowly start to think about how we can utilize fishtest to train networks and stuff like this.
This stuff seems to be really promising if it plays on level "not really worse than master" on LTC and CPUs that support AVX in just few weeks of training. Sure, most of our hardware is quite old, but we have some modern CPUs and it can be trained even on older ones just slower... So, what I think should be done :) - we should start to train some nets ourselves, maybe have 2 separate code bases or (even better) one code base with NN and handcrafted eval and UCI parameter to switch between them - people with older CPU can stay on handcrafted eval and people with modern CPUs can utilize NNUE. I honestly think that NNUE will be the future, newest CPUs make it pretty fast and it can help to just walk over cornercases that corrupt sf play a lot. Honestly fact that NNUE plays on reasonable strength in it really early days is one of the main reasons why I basically stopped to write eval patches :) I know all I say will require quite a lot of work from both developers and maybe even fishtest maintainers, but some day it still needs to be done, imho.

TesseractA commented 3 years ago

I honestly think that NNUE will be the future, newest CPUs make it pretty fast and it can help to just walk over cornercases that corrupt sf play a lot. Honestly fact that NNUE plays on reasonable strength in it really early days is one of the main reasons why I basically stopped to write eval patches :)

"Cornercases that corrupt SF play a lot" I'll bet there's equally many (if not more) corner cases to be met with the NNUE architecture, given that even leela has lots of trouble with its own kind of corner cases, especially those of which that are both distant to mate and require pruning exponentially larger search trees. Current SF has a reasonable combination of search code and eval code to be able to direct it to finding improvements in obscure endgames and make those problems far less difficult by deliberation. This may make it easier to identify and fix specific problems. In my experience with neural networks, specific problems are far harder to fix when trying to generalize evaluation.

Also, NNUE may not provide a higher ceiling than handcrafted evals because of the inefficiency of information packing in Neural Networks as opposed to formal handcrafted evaluation. NNUE can only be so large of a network that it'll probably hit its limit and it will stop improving after a certain point, much like how Leela's network architecture has hardly improved since it first had squeeze and excitation (SE) nets. That said, it's easier to train this NNUE than Lc0 because it's got so many fewer variables, so designing improvements (in the short term at least) may come easier to it.

So I'd still be a bit skeptical (even though I predict NNUE will be better in the near future) of the long-term implications of NNUE. I fear that SF could stuck in a local minimum with NNUE when the NN stops improving and people would lose interest in the SF project instead of returning to the handcrafted evaluations with a higher Elo ceiling.

If AlphaZero came 2 years earlier and blew everyone out of the water then, it probably would have made many people abandon SF instead of realizing there is still great potential for handcrafted evaluations.

The SF project is probably one of the largest (if not the largest) open source projects of handcrafted feature recognition and in my opinion it would be a shame if it were just to become an exhibit in a github museum.

All this said, it's just my experience from watching from the Lc0 stand of things.

Vizvezdenec commented 3 years ago

The difference is that 80% of elo sf gains are improvements of search. So even if eval will be "stuck" - well, it's not THAT big of a deal, tbh. Also no one prohibits you from continuing to improve handcrafted eval if nn will get stuck.

ssj100 commented 3 years ago

I don't think handcrafted evaluation should be abandoned, as the possibility of it having a higher ceiling remains. That being said, as Viz mentioned, handcrafted search appears to be "unthreatened" anyway, so the "SF project" won't become an "exhibit in a github museum" regardless. People shouldn't forget that a big reason of why SF NNUE is so strong already is because of its strong search. For example, I'd predict that if Komodo NNUE was released (Komodo being the 2nd strongest CPU-alone engine), it would still get crushed by native SF.

However, my point was that it may be prudent to do some "testing on fishtest" for the NNUE component, if just to become adept to using/testing/training it. The handcrafted eval component should still continue as much as possible, but perhaps when it comes to submitting SF for tournaments etc, the strongest version of SF should be submitted at the time (whether it's native SF or SF NNUE).

TesseractA commented 3 years ago

From watching the games currently played at CCCC I get the feeling that NNUE will over-evaluate certain endgames and native evaluation would somehow have to take over anyway (to gain elo, that is.) Some stark misevaluations make native SF a more reliable component of the engine in certain cases. That said, search behavior could end up being weird if there was a huge mismatch between NNUE evaluations and native evaluations. What I imagine might happen is that certain endgames get left to some specialized threads which take care of the native evaluations while the other threads search elsewhere with NNUE to prevent holdup. Dynamically updating which threads take care of which might improve behavior.

(e.g. NNUE seemed to evaluate a drawn KRPPPVKRPP endgame +3 while native SF was able to evaluate it at +1)

noobpwnftw commented 3 years ago

Problem is you don't really have a way to decide which eval is correct and which is not even with shallow search. With native eval, people spot certain problems and write patches, they still often break more stuff than they fix by failing fishtest, so how is NNUE going to magically make this problem disappear is beyond me.

gekkehenker commented 3 years ago

From watching the games currently played at CCCC I get the feeling that NNUE will over-evaluate certain endgames and native evaluation would somehow have to take over anyway (to gain elo, that is.) Some stark misevaluations make native SF a more reliable component of the engine in certain cases. That said, search behavior could end up being weird if there was a huge mismatch between NNUE evaluations and native evaluations. What I imagine might happen is that certain endgames get left to some specialized threads which take care of the native evaluations while the other threads search elsewhere with NNUE to prevent holdup. Dynamically updating which threads take care of which might improve behavior.

(e.g. NNUE seemed to evaluate a drawn KRPPPVKRPP endgame +3 while native SF was able to evaluate it at +1)

Those misevaluations are mostly the result of the data its been trained on.* It's at the end of the day still a net that has only seen a lot of depth 8 games and a bunch of depth 12 games.

Things should eventually improve, once we can get fishtest, or Leela or Noob's data to work.

Anyway, I turned skeptical about its scaling after seeing a fixed node test at 1m, 10m and 20m. But maybe Jjoshua's net has fixed that. We'll see over at TCEC, Jjosh's net should be stronger than mine and TCEC is less likely to bork settings than CCC.

*But a lot of them will exist even if we use deeper data, SF evaluating a draw endgame as +1 is just as wrong as Leela saying +0.8 or NNUE +3.4.

vondele commented 3 years ago

what kind of training data should those games be? All fishtest LTC games are available with scores for each position, roughly depth 20-25 that is, that's literally billions of scored positions.

gekkehenker commented 3 years ago

what kind of training data should those games be? All fishtest LTC games are available with scores for each position, roughly depth 20-25 that is, that's literally billions of scored positions.

A few others have experimented with the data but had some strange behaviour. Either because they weren't converted correctly or maybe an issue with the learning function itself.

vondele commented 3 years ago

concerning settings and nets, it would be useful if the nodchip github repo would indicate in the readme what the current optimal settings are, and give a download link to the current best net. I gave up trying to find the info when I wanted to test the fork. I know that there is, of course, a variety of opinions on these topics, but for people that want to get something running quickly, that would be very helpful.

TesseractA commented 3 years ago

@gekkehenker it's much harder* to tune a neural network to give desired relative evaluations than it is for the handcrafted alternatives.**

*might have to be proven to be known true, but stockfish's evaluations are tuned to beat other versions of itself. that makes the patches that pass alive out of fishtest very good at introducing adversarial play, which a small neural network trained on external data could not provide to such high fidelity. what ends up happening against stronger or "drawish" opponents is the neural network tends to prefer things which itself cannot evaluate properly instead of being able to focus on generating play from its own internal strengths.

**"handcrafted alternatives" rely on far more concrete values to evaluate a position, making any small differences in evaluation which might find wins/draws effect magnified. also, the deeper the search, the more false positives which the neural network generates effects how the edges of search behave, especially drawn 50-move rule bound endgames.

@noobpwnftw being able to distinguish when our handcrafted evaluations are better to use could rely on a table of precalculated values in from a file, those of which would allow us to determine what evaluation method is better for what amount of pieces on board, and what type of pieces on the board--we can create such an evaluation-accuracy piece-table by using mean square error of an evaluation to the result of a game, for which we might have to figure out how the new network's evaluations convert to "actual" win percentage. one potential downside is that might get a bit messy if different networks have different strengths. Then again, maybe there's a-lot of slowdown in figuring out which pieces are on board and loading the table. Maybe simply using the amount of pieces are on board or some value which measures how much the tree is branching is enough.

gekkehenker commented 3 years ago

concerning settings and nets, it would be useful if the nodchip github repo would indicate in the readme what the current optimal settings are, and give a download link to the current best net. I gave up trying to find the info when I wanted to test the fork. I know that there is, of course, a variety of opinions on these topics, but for people that want to get something running quickly, that would be very helpful.

This link contains a few Windows compiles (popcnt, avx2, bmi2) and my current strongest net:

https://workupload.com/file/ggEUrvNVgmH

It seems like the latest binaries (same goes for the binaries on Nodchip's repo) fixed a few bugs. No longer need to adjust slowmover, 100 works perfectly now. Extreme elo gain, on older binaries my nets were always 100+ elo weaker than SF. They now test stronger than SFDev...

It's roughly as simple as SF now. UCI option "evalfile" has to point towards the NN file. In files above it's by default "eval\nn.bin", but this can be changed to anything now. As long as it points towards the correct binary file.

There's sadly not a lot of centralized information because it was originally nothing more than a quick port to test if NNUE works in chess too. Whatever I know is build upon quick instructions from Twitter, looking through the learner.cpp code and google translated YaneuraOu docs:

https://twitter.com/nodchip/status/993432774387249153 https://github.com/nodchip/Stockfish/blob/master/src/learn/learner.cpp https://github.com/yaneurao/YaneuraOu/blob/master/docs/USI%E6%8B%A1%E5%BC%B5%E3%82%B3%E3%83%9E%E3%83%B3%E3%83%89.txt

ssj100 commented 3 years ago

Just thought it'd be important to post some real results in my testing so far.

  1. I've been testing with these conditions for many years, including with SF8, SF9, SF10, SF11, SF12dev, H5-6, K10-14.

  2. These are the general conditions: -GUI = cutechess -1-core -No TB -Time Control = 60 seconds +0.6 -Book = Balsa_v500.pgn (500 lines mainly up to 5 moves)

  3. This is the information for each engine: -SF = from abrok compile "July 11" 2020, all default settings -SF NNUE binary component = from nodchip compile "July 13" 2020, all default settings (it's important to use this binary, as older binaries were 50-100+ elo weaker for some reason) [This means both engines are using a very recent version of SF's "search code". As already discussed/mentioned in many places, the functional difference between each engine is that the abrok SF obviously uses SF's "eval code", while SF NNUE completely disables this "eval code" and uses code from a trained net ("nn.bin")] -SF NNUE net component = gekkehenker net from 27 June 2020 (which was created entirely using SF self-play games with a binary from June 2020) Start position SF speed: ~1800Mnps Start position SF NNUE speed: ~1100Mnps (~60% of SF speed)

  4. Here is the result so far: SF NNUE vs SF: 78 - 53 - 369 [0.525] Elo difference: 17.39 +/- 15.54 500 of 1000 games finished.

I'm going to let it run to 1000-games mainly just for future consistency. Some musings:

  1. You can already see SF NNUE is very likely about on par with latest SF (possibly better)
  2. The NNUE concept has likely only (publicly) been experimented with in the last few weeks in computer chess
  3. @gekkehenker literally only spent a few days creating the "eval net" above and using very limited hardware resources (literally one computer with one CPU - 6 cores/12 threads)
  4. If 1. is true, this effectively means gekkehenker has, by himself, literally managed to match (or possibly surpass) the elo strength of SF's "eval code" within a few days and with a tiny fraction of "CPU hours" of fishtest. That is, he has done what SF/fishtest (with hundreds of developers, thousands of "CPU-years" and about 12-years of hand-crafted coding/testing) has managed in a fraction of time and resources
  5. It remains to be seen if scaling for SF NNUE is good, but all the data out there so far strongly suggests that it is
  6. I can only imagine what fishtest and the SF community can achieve together with its ample resources and incredible developer talent
  7. One way forward would be to split fishtest resources, to something like as follows (assuming a default of about 1500-cores is available): -1000-cores to continue handcraft search improvement patches -100-cores to continue handcraft eval improvement patches -400-cores to train "NNUE" (Clearly this proportion can be changed accordingly as per the optimal needs etc)

Anyway, thanks to @gekkehenker and nodchip for continuing to share their knowledge publicly!

crocogoat commented 3 years ago

I didn't have much luck with anything I tried so far but with the link from @gekkehenker low TC is testing great for me. Using settings close to fishtest, 10+0.1 same book and default settings:

Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 2742 - 1735 - 5595 [0.550] Elo difference: 34.85 +/- 4.51

I'm not really sure I understand/trust it completely though. I did try to double check everything but can't see anything obviously wrong. I'm going to test 20+0.2 now.

gekkehenker commented 3 years ago

I didn't have much luck with anything I tried so far but with the link from @gekkehenker low TC is testing great for me. Using settings close to fishtest, 10+0.1 same book and default settings:

Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 2742 - 1735 - 5595 [0.550] Elo difference: 34.85 +/- 4.51

I'm not really sure I understand/trust it completely though. I did try to double check everything but can't see anything obviously wrong. I'm going to test 20+0.2 now.

Yes, the first time I saw the results of the new binaries I couldn't believe them either. "I must have done something wrong" is what I thought.

In an era where a 5 elo patch is believed as too good to be true, a 30 elo "patch" must be impossible to believe.

ssj100 commented 3 years ago

I didn't have much luck with anything I tried so far but with the link from @gekkehenker low TC is testing great for me. Using settings close to fishtest, 10+0.1 same book and default settings:

Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 2742 - 1735 - 5595 [0.550] Elo difference: 34.85 +/- 4.51

I'm not really sure I understand/trust it completely though. I did try to double check everything but can't see anything obviously wrong. I'm going to test 20+0.2 now.

Your result is "consistent" with basically every test done so far (including mine) that used nodchip's binaries (or equivalent) from July 11th or later. Again, testing with the newer binaries is crucial (probably stick with July 13th binary until we're absolutely certain of the strength improvement), as older binaries were for some reason 50-100+ elo weaker - SF is so far ahead of the rest that it was still a relatively strong engine, around the level of Komodo 14.

It appears that the elo difference at 10+0.1 (and likely even shorter TC) is likely bigger than at 60+0.6. The elo difference seems to be around 30-50 at the shorter TCs, and around 15-35 at the longer TCs. It'd be interesting to see if fishtest can verify these numbers - ideally test at its usual TC for patches - 10+0.1 and 60+0.6 with 1-thread, and 5+0.05 and 20+0.2 with 8-threads, all to 40,000 games each or similar.

crocogoat commented 3 years ago

Yeah fishtest tests would be quite something if that is possible. My own test for 20+0.2 I stopped when it was giving a similar result:

20+0.2: Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 506 - 292 - 1224 [0.553] Elo difference: 36.91 +/- 9.47

and then I started the more interesting 60+0.6 and that, while with little amount of games so far, did as well:

60+0.6 hash64: Score of sf-nnue-bmi2-256halfkp vs stockfish_20071122_x64_bmi2: 204 - 105 - 663 [0.551] Elo difference: 35.51 +/- 12.23

ssj100 commented 3 years ago

Just to follow up on my testing from above. The 1-core test finished as follows:

SF NNUE vs SF: 161 - 103 - 736 [0.529] Elo difference: 20.17 +/- 11.02 1000 of 1000 games finished.

2-core test with exactly the same conditions as above, currently showing even better results, although sample sizes are tiny to draw any conclusions about scaling:

SF NNUE vs SF: 81 - 30 - 327 [0.558] Elo difference: 40.64 +/- 16.14 438 of 1000 games finished.

vondele commented 3 years ago

So, with the net from @gekkehenker (c157e0a5755b63e97c227b09f368876fdfb4b1d104122336e0f3d4639e33a4b1 nn.bin) and current master (https://github.com/nodchip/Stockfish.git 7a13d4ed60b09a9ce1b5aee46aa2a596bc4ca0fd) I get the following results:

STC (10.0+0.1 @ 1 thread)
Score of master vs nnue: 940 - 2206 - 3973  [0.411] 7119
Elo difference: -62.4 +/- 5.3, LOS: 0.0 %, DrawRatio: 55.8 %

LTC (20.0+0.2 @ 8 thread)
Score of master vs nnue: 189 - 463 - 1332  [0.431] 1984
Elo difference: -48.3 +/- 8.7, LOS: 0.0 %, DrawRatio: 67.1 %

That's a bit better than the results posted previously. The cutechess cmdline is quite standard:

./cutechess-cli -repeat -rounds 10000 -games 2 -tournament gauntlet -resign movecount=3 score=400 -draw movenumber=34 movecount=8 score=20 -concurrency 15 -openings file=noob_3moves.epd format=epd order=random plies=16  -engine name=master cmd=stockfish.master -engine name=nnue cmd=stockfish.nnue option.EvalFile=/home/vondele/chess/match/nn.bin -ratinginterval 1 -each tc=10.0+0.1 proto=uci option.Threads=1 -pgnout nnue.pgn
adentong commented 3 years ago

Tests on ccc seems to indicate that nnue can't handle more than 64 threads though? Is that true or is ccc nnue set up incorrectly? Anyways I highly doubt blitz tests represent the true strength difference at vltc (I'm talking about tcec conditions). I expect at best +20 elo in those conditions (which by the way was my prediction on how much better leela was back when a horde of leela fans were claiming +50 at least.).

vondele commented 3 years ago

well it is unlikely that fundamentally nnue would show worse threading behavior. After all, this is just changing eval, which is really threading-independent. However, there could be threading related bugs, or new threading-related bottlenecks that haven't been found. That could happen in a relatively new code. Another thing to consider is that there might be a difference in performance wrt. hyperthreading as the nnue has different characteristics (e.g. avx2 intensive). A first test at a higher thread count here seems fine:

VLTC (20.0+0.2 @ 16 threads)
Score of master vs nnue: 292 - 698 - 2202  [0.436] 3192
Elo difference: -44.4 +/- 6.6, LOS: 0.0 %, DrawRatio: 69.0 %
gekkehenker commented 3 years ago

On CCC it was previously running on WINE, hence the 64 thread limit. It's running on 90 threads now.

Vizvezdenec commented 3 years ago

well I think that 40+ elo perf is enough to justify putting effort into it :)

crocogoat commented 3 years ago

One way of possible integration would be to keep the UCI option to skip loading NN but have the normal eval used as the base, also used in case a NN file is not present. That way on fishtest people could also test against it by setting options. Currently in NNUE if you skip loading eval it will just have none at all.

Beyond that the fact that the normal eval gets about double the nps could mean that it's still more efficient in some form, maybe for endgames? Lazy eval comes to mind as an example of an elo-gaining change in evaluation depending on the game.

vondele commented 3 years ago

Meanwhile some limited results at 20.0+0.2 @ 250 threads, looks consistent with the other numbers so far.

Score of master vs nnue: 13 - 34 - 153  [0.448] 200
Elo difference: -36.6 +/- 23.0, LOS: 0.1 %, DrawRatio: 76.5 %
Ipmanchess commented 3 years ago

Played 22.000games TC: 10s+1s with sf-nnue-bmi2-256halfkp : http://ipmanchess.yolasite.com/i9-7980xe.php +37Elo ,Ordo shows +39.6Elo above Stockfish 11 !

1) sf-nnue-bmi2-256halfkp 3530.1 : 22000 (+14921,=6644,-435), 82.9 %

vs.                                :  games (     +,    =,   -),   (%) :    Diff,    SD, CFS (%)
Stockfish 11 x64 bmi2              :   1000 (   281,  631,  88),  59.6 :   +39.6,   3.1,  100.0
Stockfish 10 x64 bmi2              :   1000 (   389,  547,  64),  66.3 :   +87.9,   3.0,  100.0
asmFishW 2018-06-12 bmi2           :   1000 (   379,  574,  47),  66.6 :  +113.2,   2.8,  100.0
Komodo 14 64bit bmi2               :   1000 (   499,  457,  44),  72.8 :  +174.1,   3.2,  100.0
Houdini 6.03 Pro x64 bmi2          :   1000 (   536,  431,  33),  75.2 :  +175.1,   3.1,  100.0
Komodo 13.3 64bit bmi2             :   1000 (   531,  427,  42),  74.5 :  +187.5,   3.0,  100.0
Ethereal 12.13  x64 pext           :   1000 (   658,  332,  10),  82.4 :  +285.1,   2.9,  100.0
Ethereal 12.00 x64 pext            :   1000 (   693,  293,  14),  84.0 :  +294.8,   3.0,  100.0
Komodo 13.2.5 x64 bmi2 MCTS        :   1000 (   732,  260,   8),  86.2 :  +308.5,   2.7,  100.0
Komodo 13.3 x64 bmi2 MCTS          :   1000 (   719,  268,  13),  85.3 :  +310.5,   3.2,  100.0
Xiphos-0.6-w64-bmi2                :   1000 (   671,  318,  11),  83.0 :  +310.7,   3.2,  100.0
Fire 7 x64 popcnt                  :   1000 (   700,  293,   7),  84.7 :  +325.1,   3.1,  100.0
Xiphos-0.5.3-w64-bmi2              :   1000 (   699,  291,  10),  84.5 :  +329.8,   3.0,  100.0
rofChade 2.3 bmi2                  :   1000 (   780,  211,   9),  88.5 :  +378.9,   3.1,  100.0
Laser 1.7 bmi2                     :   1000 (   822,  174,   4),  90.9 :  +407.8,   3.0,  100.0
Fire 6.1 x64 popcnt                :   1000 (   818,  176,   6),  90.6 :  +409.1,   3.0,  100.0
rofChade 2.203 bmi2                :   1000 (   799,  199,   2),  89.8 :  +420.9,   3.2,  100.0
Defenchess 2.2 pop                 :   1000 (   825,  170,   5),  91.0 :  +434.6,   3.1,  100.0
Ginkgo 2.18 bmi2                   :   1000 (   844,  150,   6),  91.9 :  +440.2,   3.1,  100.0
Ginkgo 2.1 bmi2                    :   1000 (   840,  154,   6),  91.7 :  +446.7,   3.1,  100.0
Booot 6.4 x64 pop                  :   1000 (   856,  140,   4),  92.6 :  +453.1,   3.2,  100.0
RubiChess 1.7.2                    :   1000 (   850,  148,   2),  92.4 :  +455.3,   3.3,  100.0
vondele commented 3 years ago

@Ipmanchess can you specify exactly which version of the code and the the net you used (git sha, sha256sum of net?). That should help to understand the difference 39 Elo vs SF11 or >40 Elo vs SFdev. This might also be a book effect (I've been using the noob_3moves.epd book).

MRMikaelJ commented 3 years ago

But the result is +281, =631 , -88 (59.65 % vs sf 11) isn't that like a +68 elo preformance? or is the elo calculator i use simply for a different elo calculation.

To note is also that nnue do not have contempt

jjoshua2 commented 3 years ago

The code currently doesn't work with contempt (changing contempt doesnt change evals at all) so it could just be underperformance against weak opponents?

jjoshua2 commented 3 years ago

I just tested 384 sized first layer, 30mb net (which is 50% larger ) stockfiNN 0.1 with a fixed binary from 7-14 at 10s+0.1s and despite even further slowdown it still beats sf-dev, 1000 games. It still gets almost 60% speed of sf-dev on my Zen2 arch

Score of stockfinn1 vs stockfish_20070321_x64_modern: 356 - 226 - 418 [0.565]
Elo difference: 45.4 +/- 16.5, LOS: 100.0 %, DrawRatio: 41.8 %
repeated
Score of stockfinn1 vs stockfish_20070321_x64_modern: 376 - 223 - 401 [0.577]
Elo difference: 53.6 +/- 16.7, LOS: 100.0 %, DrawRatio: 40.1 %
Score of stockfinn2 vs stockfish_20070321_x64_modern: 368 - 196 - 435 [0.586]
Elo difference: 60.4 +/- 16.2, LOS: 100.0 %, DrawRatio: 43.5 %
Score of sf-nnue-avx2-256halfkp-Pleomati 7-9 vs stockfish_20070321_x64_modern: 711 - 444 - 845 [0.567]
Elo difference: 46.7 +/- 11.6, LOS: 100.0 %, DrawRatio: 42.3 %

320 games from drawkiller book same PC, binary, and 10+0.1s TC

Score of stockfinn1 vs stockfish_20070321_x64_modern: 140 - 79 - 101 [0.595]
Elo difference: 67.1 +/- 31.9, LOS: 100.0 %, DrawRatio: 31.6 %
Score of stockfinn2 vs stockfish_20070321_x64_modern: 139 - 82 - 99 [0.589]
Elo difference: 62.6 +/- 32.0, LOS: 100.0 %, DrawRatio: 30.9 %
sf-nnue-avx2-256halfkp-Pleomati 7-9 bundled with gek 2706
Score of sf-nnue-avx2-256halfkp-Pleomati 7-9 vs stockfish_20070321_x64_modern: 125 - 88 - 107 [0.558]
Elo difference: 40.4 +/- 31.2, LOS: 99.4 %, DrawRatio: 33.4 %

1000 more games with drawkiller book

Score of stockfinn2 vs stockfish_20070321_x64_modern: 448 - 207 - 345 [0.621]
Elo difference: 85.4 +/- 17.7, LOS: 100.0 %, DrawRatio: 34.5 %

Massive elo gains, with regular and draw reducing books!

Vizvezdenec commented 3 years ago

strange since we calculate contempt effect in search.cpp which shouldn't be really changed (?)

MRMikaelJ commented 3 years ago

but it is a bonus added in evaluate.cpp to the actual score. This for example never happens with nnue https://github.com/official-stockfish/Stockfish/blob/master/src/evaluate.cpp#L834

Vizvezdenec commented 3 years ago

Ah, I think they changed where it was. Then yeah, it may be a contempt (lack of) effect. Against sf11 nnue shows 70 elo in this test, slightly lower than should but this is also (I guess) due to lack of contempt/luck/etc.

gekkehenker commented 3 years ago

1 engine on CCRL blitz currently.

http://ccrl.chessdom.com/ccrl/404/cgi/engine_details.cgi?print=Details&each_game=1&eng=Stockfish%2BNNUE%20150720%2064-bit%204CPU#Stockfish%2BNNUE_150720_64-bit_4CPU

ELO difference on ipmanchess might be smaller than H2H because it isn't stomping the weaker engines quite as hard as you'd expect it to based on it's SF11 performance. Probably contempt.

ssj100 commented 3 years ago

So, with the net from @gekkehenker (c157e0a5755b63e97c227b09f368876fdfb4b1d104122336e0f3d4639e33a4b1 nn.bin) and current master (https://github.com/nodchip/Stockfish.git 7a13d4e) I get the following results:

STC (10.0+0.1 @ 1 thread)
Score of master vs nnue: 940 - 2206 - 3973  [0.411] 7119
Elo difference: -62.4 +/- 5.3, LOS: 0.0 %, DrawRatio: 55.8 %

LTC (20.0+0.2 @ 8 thread)
Score of master vs nnue: 189 - 463 - 1332  [0.431] 1984
Elo difference: -48.3 +/- 8.7, LOS: 0.0 %, DrawRatio: 67.1 %

That's a bit better than the results posted previously. The cutechess cmdline is quite standard:

./cutechess-cli -repeat -rounds 10000 -games 2 -tournament gauntlet -resign movecount=3 score=400 -draw movenumber=34 movecount=8 score=20 -concurrency 15 -openings file=noob_3moves.epd format=epd order=random plies=16  -engine name=master cmd=stockfish.master -engine name=nnue cmd=stockfish.nnue option.EvalFile=/home/vondele/chess/match/nn.bin -ratinginterval 1 -each tc=10.0+0.1 proto=uci option.Threads=1 -pgnout nnue.pgn

Thanks for testing this @vondele! The "20.0+0.2 @ 8 thread" result is identical conditions to fishtest SPRT SMP LTC tests, and I'd gather would have passed the SPRT bounds in less than 1000-games?

And yes, I think the different absolute results are likely at least due to different books etc.

vondele commented 3 years ago

and last number from my side for today, using a bit a longer TC (120.0+1.2)

Score of master vs nnue: 364 - 904 - 2798  [0.434] 4066
Elo difference: -46.4 +/- 5.9, LOS: 0.0 %, DrawRatio: 68.8 %
Ipmanchess commented 3 years ago

@vondele ,you can always find some comments/info under Testings and choose right system : http://ipmanchess.yolasite.com/testings-i9-7980xe.php and i also use noob 3moves on my i9 7980XE.

jjoshua2 commented 3 years ago

Accidentally ran my 1000 book twice but got results same setup 10s+0.1s with net 2706 now

Score of sf-nnue-avx2-256halfkp-Pleomati 7-9 vs stockfish_20070321_x64_modern: 711 - 444 - 845 [0.567]
Elo difference: 46.7 +/- 11.6, LOS: 100.0 %, DrawRatio: 42.3 %

Finished 1 elo ahead of stockfinn1 obviously within error bars. EDIT: Updated previous post to have all results!

adentong commented 3 years ago

This is very exciting and all, but what now? Do we just completely abandon handcrafted eval?

adentong commented 3 years ago

Or do we keep trying to improve it? With nnue being only 60% as fast as regular sf if handcrafted eval can be improved to even just 80% of nnue regular sf would be on top again.

Vizvezdenec commented 3 years ago

we can leave it as is and maybe have patches on eval running with lower prio. But it's up to maintainers ofc