official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
11.34k stars 2.25k forks source link

NNUE ideas and discussion (post-merge). #2915

Closed vondele closed 4 years ago

vondele commented 4 years ago

I'll create this new issue to track new ideas and channel early questions.

jjoshua2 commented 4 years ago

I can see how that could be. So how do we know if elo gain is more from new term or from raising or because that net was stronger or weaker in many pawn area... I guess the question is how to merge them or decide what must be retested Edit now i think a new net is not a factor probably. But still there is 480 or 600 as threshold

vondele commented 4 years ago

With pull requests changing parts of the code that are somehow related, or obviously overlapping, it has been up to the developers to suggest the best path forward, and ultimately up to the maintainer to decide. Note, that none of the tests has passed yet, so the question is a bit premature. So far, with the wave of patches after NNUE merge, I have committed most of them without requests for retesting. Once new-patch-rate goes a bit down, the most suitable gets merged first, the other one is rebased and retested. In this case, assuming all three pass, I would presumably consider to merge the new net and one of the patches that changes the bounds. I would have to see precise results, and maybe hear some background info by @Vizvezdenec to decide which one goes. The other one could be retested, presumably in a form with tweaked params.

Vizvezdenec commented 4 years ago

I honestly think that my patch is somewhat logically better because I always preferred some scaling and not flat thresholds. Like extra exchange w/o pawns is probably smth classical eval can handle with (we even have special eval functions for this cases) while positions with most pawns are usually the hardest for it to evaluate properly, usually positions with most pawns are "the hardest" to evaluate properly and this patch does direct scaling of using "better" eval with pawn count. My patch starts to use NNUE at 7+ pawns more often comparing to jjosh one and honestly I think this is the area where you want to use NNUE the most. Also I wanted to scale up this threshold with npm also - this also looks really logical. Another unrelated thing - your patch wasn't looking to pass 2nd STC - https://tests.stockfishchess.org/tests/view/5f2f2f249081672066536b48 If anything I prefer to merge my patch (if it passes ofc) and then test some stuff as a simplifications on top of it. Also note that at high pawn counts threshold raises up to "knight - pawn" which is important for a lot of tactics.

jjoshua2 commented 4 years ago

Yes I agree we need probably a lot more eval terms. However, I would like to get the current threshold tuned first so more complications are definitely stronger and wont be able to simplified right away. I think the complexity term is a good starting point. Maybe its possible to do a small SPSA tune with just tuning the threshold and your new term and see what it outputs? If the new term is close to zero then a simplification patch would probably pass?

Vizvezdenec commented 4 years ago

I think we need to wait for some patch to pass. If it will be mine you can test flat 600 as a simplification (or maybe even higher value, why not). If it's yours I will just stop my test. Note that I've seen anything, like tests not passing LTC after 2.91 LLR ... So we need to wait patches to converge first :)

jjoshua2 commented 4 years ago

That sounds pretty reasonable. Good like with your patch it is close :)

NKONSTANTAKIS commented 4 years ago

Is it likely that adjudication rules contribute to the success of swapping to classical eval, by triggering more adjudicated wins of positions that are/could be drawn despite the high eval?

I think its alarming and should be investigated because for example if a fortress is more correctly (modestly) evaluated by nnue will grant a draw, while an incorrect high evaluation will grant a win.

And in general maybe we would like more accurate/less aggressive adjudication?

syzygy1 commented 4 years ago

@NKONSTANTAKIS Interesting example, but to "win" this fortress position by adjudication, it seems hybrid SF would need to have the agreement of its opponent NNUE SF that it is winning. (At least I suppose that both engines have to show a winning/losing eval for adjudication to kick in.)

Assuming NNUE SF's eval is more accurate in a particular game, I don't easily see how it could lose points to hybrid SF because of the adjudication rules. But there may be a scenario I am not thinking of...

Karma-Tron commented 4 years ago

If I might have another go for an UCI option to set NNUE to always on, I think the solution Crystal is using is very elegant, and satisfies users like me that like to use Stockfish for analysing purposes.

Please take a look at https://github.com/jhellis3/Stockfish/commit/15c515567a4d4d1af207452fed08b2d85fc7f4b0

Vizvezdenec commented 4 years ago

I think we can do smth like th.marked in reduction but for positions that use "classical" eval. If position was already evaluated by classical eval by other thread, evaluate it by NNUE instead - maybe it will be an elo gain at SMP. I'm trying something in this direction but my knowledge of SMP is too bad to do it in any sort of robust way.

vondele commented 4 years ago

I doubt this is easy. One would need essentially an additional bit in the transposition table for that, and it is unclear if it would gain.

Sopel97 commented 4 years ago

This could work okayish with a bloom filter. But I'm not sure if the computational cost required wouldn't just make it better to use NNUE all the way.

Vizvezdenec commented 4 years ago

Well I'm just giving a basic idea :) Since we have 2 evals than maybe we can try this sort of stuff. Interaction with transposition table is yeah, not obvious, but it's smth to work on I guess.

gekkehenker commented 4 years ago

Is it possible that with many threads the hybrid eval is elo negative? Not the most conclusive results, but maybe concerning. Score of (Hybrid)stockfish_20080621_x64_bmi2 vs (NonHybrid)stockfish_20080616_x64_bmi2: 179 - 167 - 1654 [0.503] Elo difference: 2.1 +/- 6.3, LOS: 74.1 %, DrawRatio: 82.7 %

2000 of 2000 games finished.

https://drive.google.com/file/d/1gvEKMrbwLlhRp48Aax1g_ja0gfzuomhV/view?usp=sharing

CPU: i5-10400 TC=5+0.05, 12 threads. Noobbook 3 moves Games are adjudicated with 6 pieces on board.

Binaries are already outdated... But if LTC passed by 11.30 elo, then this result may point towards SMP losses inherent to hybrid.

Vizvezdenec commented 4 years ago

Current regression tests may give partial answer, currently SMP is lower than singlecore which was never the case since sf 11

crocogoat commented 4 years ago

As for 1-core after the recent patches I thought it time to investigate again NNUE use only, but it failed rather quickly: https://tests.stockfishchess.org/tests/view/5f32b00d66a893ef3a025e11

vondele commented 4 years ago

there is little reason to believe SMP influences the effect of hybrid eval. Who knows... however. IMO the current small difference between single-threaded and SMP could also be attributed to things like sss, and the fact that we're already in 'extreme' territory, W/L ration is near 10, draw rate difference between both tests is near 5%. To answer the question, I've started a SPRT LTC SMP: https://tests.stockfishchess.org/tests/view/5f32dc4366a893ef3a025e4c

syzygy1 commented 4 years ago

Hmmm, simplification passed? :-)

vondele commented 4 years ago

AVX512 fleet did help :-)

syzygy1 commented 4 years ago

So make the threshold dependent on the cpu's instruction set?

vondele commented 4 years ago

just launch the raspberry pi 3 fleet for next test.

syzygy1 commented 4 years ago

Just change Use NNUE from check to spin and use it as the threshold ;-)

noobpwnftw commented 4 years ago

Or threshold depending on ARCH. modern - normal, AVX2 - higher, BMI2 - even higher, AVX512 - infinity.

vondele commented 4 years ago

seems like a good plan for most constants in search...

noobpwnftw commented 4 years ago

If you can classify workers in the future, then it is possible to decide optimal values for those, I do have sufficient number of workers of each kind...

Vizvezdenec commented 4 years ago

So our search will be probCutBeta = ARCH == #archname1# ? 145 - 40 improving : ARCH == #archname2# ? 128 - 33 improving : 110 - 28 * improving; etc. ?

syzygy1 commented 4 years ago

In the future all workers will at least have avx2. Older than should remain supported but do not need to be optimised for, I would think.

noobpwnftw commented 4 years ago

There is one thing, I think at some point people still want to use a standalone fully functioning binary(which includes a net). This makes integration a lot easier for example deployment on devices/browsers.

vondele commented 4 years ago

I've seen somewhere a patch which used a linker hack to embed the net as an object file in the binary. Not sure if somebody can find it back. Not sure if we really want that..

noobpwnftw commented 4 years ago

I mean we can just bin2hex into a header file and #include it on-the-fly with a makefile flag, which compiles a self-contained binary including the default net. There are cases where downloading files on the machine that runs the engine is not trivial, or loading an external file is not possible due to platform limitations.

vondele commented 4 years ago

so for reference that's the patch I've seen: https://tests.stockfishchess.org/tests/view/5f135fc1da64229ef7dc173d i.e. https://github.com/gekkehenker/Stockfish/compare/ffebd6f652...62cbeab1f0

not saying I like this approach, but it is at least the first patch I've seen.

Clearly, also make net could write a bin2hex based source file, probably compile time won't improve with ~20Mb of sources.

We can add https://github.com/nodchip/Stockfish/issues/76 to the stack...

Sopel97 commented 4 years ago

that's how I would do it too. Using the linker is the best.

Using include with bytes for that is... let's just say not ideal. Strings are better but MSVC has a very small limit on string literal length. https://mort.coffee/home/fast-cpp-embeds/

noobpwnftw commented 4 years ago

Well, MSVC also doesn't have ld so one way or the other. Just trying to simplify things for end users who will just load it and run.

Vizvezdenec commented 4 years ago

So, should we remove hybridisation (at least for now)? If so I would suggest to run truncated sudo-rts on LTC/smp LTC (like with 1/4 of usual games) to see if results of them align now.

VoyagerOne commented 4 years ago

Idea: At the root position before entering search, evaluate every legal move with NNUE and sort the moves. This should provide very good move ordering right off the bat.

Vizvezdenec commented 4 years ago

well now I'm trying to partially enable some threads to do clean NNUE search, one passed both STC and LTC SMP, other passed STC and is looking good at LTC. The problem is with hash table, we will have 2 evals writing god only knows what there...

jjoshua2 commented 4 years ago

Can't you write the NNUE evals at depth + 2 or 3 so it will tend to overwrite the worse evals?

On Tue, Aug 11, 2020 at 8:56 PM Michael Chaly notifications@github.com wrote:

well now I'm trying to partially enable some threads to do clean NNUE search, one passed both STC and LTC SMP, other passed STC and is looking good at LTC. The problem is with hash table, we will have 2 evals writing god only knows what there...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/official-stockfish/Stockfish/issues/2915#issuecomment-672411729, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXIQNHQJPW3577KVTZWKTDSAHSDFANCNFSM4PWWEZHQ .

crocogoat commented 4 years ago

Maybe there's something to take from these tuning tests too. Tuned at STC and at STC they pass, but at LTC they fail quickly:

Tuning: https://tests.stockfishchess.org/tests/view/5f3105c190816720665374ca

https://tests.stockfishchess.org/tests/view/5f3396e066a893ef3a02638c https://tests.stockfishchess.org/tests/view/5f33bb9366a893ef3a0263ae

https://tests.stockfishchess.org/tests/view/5f3396fe66a893ef3a02638f https://tests.stockfishchess.org/tests/view/5f33a9db66a893ef3a02639a

So maybe NNUE is much more sensitive to search values depending on TC? or amount of threads used? Almost certainly 1-thread has gained much more Elo since the NNUE introduction as seen in the regression tests:

1-Thread: https://tests.stockfishchess.org/tests/view/5f2c5a25b3ebe5cbfee85b8e https://tests.stockfishchess.org/tests/view/5f32844966a893ef3a025dde

From 83.42 to 125.60 Elo

8-Thread https://tests.stockfishchess.org/tests/view/5f2c5a52b3ebe5cbfee85b91 https://tests.stockfishchess.org/tests/view/5f32846e66a893ef3a025de0

From 86.10 to 111.76 Elo

noobpwnftw commented 4 years ago

Will unfortunately need to do 1) disable HT on all workers, 2) do not run NNUE on more than available physical cores, 3) take AVX down clocking the whole CPU into consideration. Currently only 1-core mixed tests are "not affected" due to everything balanced out at scale but multicore tests shows obvious corner-cutting from chip vendors.

vondele commented 4 years ago

@noobpwnftw that would be something to investigate first. While the eval is avx, roughly 70% of the time goes still to search, movegen etc, that are integer/branching dominated. It could be that these actually complement/interleave? I guess we need to do measurement of the value of hyperthreading, not only for nps, but also for playing strength.

NKONSTANTAKIS commented 4 years ago

I think a non-hybrid repeat of the same multithreaded RT will be very useful for identifying how much of its underperforming is due to hybrid and how much due to scaling. Mainly due to 1-core search optimization.

Regarding the latter due I'm confident that a wide multithreaded VLTC tune will do wonders. Even with just 50K games.

A 3rd factor for underperforming is approaching the limit of winnable book exits. 8-threads SF11 can probably draw a considerable amount of openings vs any opponent and 8-moves book is low spread.

noobpwnftw commented 4 years ago

I do not think however it is worth investigating for now, as there may be many non-hardware related optimizations and bug fixes that may produce convincing positive Elo. Not sure what's the matter with hybrid at this point, it is non-regression at best by removing it so far(with the help of full AVX512 fleet), and apparently less affected by HT, ARCH, multi-core.

As for the inconsistency, it is well understood(at least to me) due to different worker types, and I recommended that one can do further optimizations by tuning on specific hardware, it has been done, for example, porting to C, or rewrite in assembly, while for the official branch we should just keep it as generic as possible and works "good enough" for everyone.

So I don't understand what is the goal people trying to achieve by measuring optimal "NNUE threshold" up to decimal accuracy, or simplify it away only to wait for someone to reintroduce it, ego again?

NKONSTANTAKIS commented 4 years ago

@syzygy1 Right, and I'm not worried anymore. I can think of a scenario but it will be rare so its effect minimal. I will expand just for theoretical interest.

Fortresses often lead to different types of fortresses, so if nnue evals the initial type correctly low (due to it appearing more in training, being closer to start pos) it might avoid it for a different variation, especially so as sergiovieri training is mainly outcome- based. Classical high eval will opt for more fortresses, part of them transforming to different types. It will take just a single high eval nnue misevaluation of a subsequent fortress to be awarded a win.

So not worried about this showing in elo, but at the same time believing that increased adjudication accuracy can only be healthy. It might introduce small but very bad type of biased noise in unforseen ways. The extra cost should be minimal.

syzygy1 commented 4 years ago

Am I wrong to think that NNUE has endianness problems? For example, ReadParameters() reads a uint32_t by reading 4 chars into the memory location of the uint32_t. The outcome is dependent on the machine's endianness.

vondele commented 4 years ago

@syzygy1 I've turned this in an issue, better discussed there #3007

syzygy1 commented 4 years ago

I did not fully research this point yet, but it seems to me that skipping the NNUE eval in hybrid mode is also skipping the incremental update part of the evaluation. There is some code in the NNUE part to trigger the incremental updates such as "update_eval()" but it is never called (as far as I can tell).

Am I missing something?

vondele commented 4 years ago

Needs a careful look, I didn't do yet, but I don't think it is a problem, since during search we don't call evaluate for each node we visit for several reasons (most obviously TT). This seems it would just be another reason. How this works exactly, I haven't checked.

However, this is maybe the 100Elo bug we've been looking for ;-)

syzygy1 commented 4 years ago

I guess you are right that the same (non?)problem should exist in non-hybrid nnue mode.

If this problem is real, some speed gain seems possible.

The code in do_move() that sets computed_accumulation to false probably ensures that no incorrect nnue evals are returned (e.g. by calling nnue_evaluate() on a child node if the current node is not being nnue_evaluate()d).

syzygy1 commented 4 years ago

At the moment the accumulator is copy-make (part of StateInfo). It might make sense to make it make-unmake (part of Position).

syzygy1 commented 4 years ago

The first NNUE version of Stockfish calls RefreshAccumulator() from Transform() about 1 in 3 times. This is more than I had expected.