Closed vondele closed 4 years ago
As NNUE makes good use of intrinsics. Optimize the availability of SIMD/Vector Extensions so that appropriate intrinsics are used where possible. This may mean that Fishtest is changed to identify available instructions for each worker.
Apply lazy threshold before considering NNUE evaluation.
Similar, but I think the value used for lazy threshold in evaluate
is more meaningful and possibly in rare cases the patch above would use NNUE eval while normal eval would return after first lazy threshold. So this patch could maybe be improved with
value
up to first lazy thresholdvalue > t0 then return value
evalNNUE or value > t1 then return evaluate()
return evalNNUE()
@vondele. Yes. Fishtest to collect and show architecture.
I filed an issue about the collecting and showing : https://github.com/glinscott/fishtest/issues/743
Is there an easy way to separate bishop piece types into light square and dark square bishops? The NN might benefit from having OCB information.
Is there an easy way to separate bishop piece types into light square and dark square bishops? The NN might benefit from having OCB information.
This is exclusively a training issue. The proper place is therefore https://github.com/nodchip/Stockfish/
During upload it might be a good idea to explicitly tell that only the author of the net is allowed to upload. (If it doesn't already fall under a CC0 license)
Add option to enable or disable lazyeval / hybrideval.
I've seen it gains elo in Fishtest conditions. But quoting Dkappe: "it performs worse on nets that don’t conform to stockfish eval, which are all of mine. It also defeats the purpose of training nets that play like other engines."
@gekkehenker the upload page already mentions that.... but yes, we might need a tick box. (Edit: issue https://github.com/glinscott/fishtest/issues/744)
concerning the extra option, no I don't think we want to do that.
Is there an easy way to separate bishop piece types into light square and dark square bishops? The NN might benefit from having OCB information.
This is exclusively a training issue. The proper place is therefore https://github.com/nodchip/Stockfish/
Thanks, but my question was more about the non-training side. The training code seems generic enough that it won’t need to be modified for new piece types. My concern is that it seems hard to split bishops into two piece types without breaking classical eval or movegen.
he training code seems generic enough that it won’t need to be modified for new piece types.
On the contrary, it's the NN which is generic enough as to need no special treatment. The training,OTOH, does, if you want it to treat the differently-colored bishops differently (since, at present, it is heavily biased towards colorblindness).
concerning the extra option, no I don't think we want to do that.
I personally think this is a disappointing decision. Supporting nets generated in all sorts of way other than Stockfish evals and searches would make Stockfish more valuable as a research tool, and I think it would ultimately also help progress.
@romstad well, that decision is not set in stone. Furthermore, how the nets are generated is of course left fully open (i.e. trained on SF evals or otherwise). However, what I would be reluctant to have would be various different way to 'integrate' the net in SF, i.e. I'd prefer to have one approach (which could by hybrid like we have now, or pure, or additive, etc), which we pick based on playing strength. So, people should be free to test any (reasonable) net, integrated in any (reasonable) way in SF, and if it passes SPRT we'll happy to adjust. Having too many choices early on will make it difficult to change in the future (i.e. do we maintain / support / N different variants?). At least that's my 2c right now.
concerning the extra option, no I don't think we want to do that.
I don't know what would be the best solution here because on one hand, we don't want to multiply UCI options and complicate code but on the other hand, "pure NNUE" could be useful for analysis. In gameplay new patch is stronger but at the same time it is much more blind to spectacular sacrifices because the classic eval kicks in and the move is pruned into oblivion.
For example: r1b1qr1k/2p3pp/4p3/1pb1PpN1/pn3N1P/8/PPP1QPP1/2KR3R w - - 0 1 Here Rd8 actually wins but Stockfish won't realize it now.
Of course changing this because someone might produce a net that is not compatible with Stockfish eval range is not a good idea. But keeping NNUE in Stockfish relevant for analysis is something to consider IMO.
If Stockfish 12 is not an imminent release could some official statement (blog.stockfishchess.org) be released to explain what has happened with NNUE merge the last few days, maybe outline how NNUE works and where the dev is at. A lot of people are using different flavors/branches of SF and that is leading to some inaccuracies in communication.
@Viceroy-Sam we haven't been communicating much beyond the releases. I think the blog, twitter etc, are @daylen. Note also we're still ironing out the wrinkles, and progress is being made very quickly. I'd expect another 10-20 Elo by the end of the weekend.
@Viceroy-Sam @vondele Yeah, I'm happy to make a blog post/twitter with the contents of this commit message (which I think is a pretty good summary!) https://github.com/official-stockfish/Stockfish/commit/84f3e867903f62480c33243dd0ecbffd342796fc
In gameplay new patch is stronger but at the same time it is much more blind to spectacular sacrifices because the classic eval kicks in and the move is pruned into oblivion.
For example: r1b1qr1k/2p3pp/4p3/1pb1PpN1/pn3N1P/8/PPP1QPP1/2KR3R w - - 0 1 Here Rd8 actually wins but Stockfish won't realize it now.
Yes, I've noticed many positions where "pure" SF NNUE found quickly that it has become totally blind to since the hybrid patch. It's a little disappointing for me, but then "elo gain is elo gain". The way I see it, it means with the hybrid patch, SF finds moves faster in many other positions, which is "equally" important from an objective standpoint. Furthermore, if there is an objective overall elo gain (which we are confident is true given it passed SPRT bounds on fishtest), that by definition means that SF (with hybrid patch) now finds the "better" move in more positions than without the hybrid patch.
Thinking about it more, it doesn't really make sense to add more UCI switches just to better analyze some positions. It's a bit similar situation to null move pruning and many other features - in general brings Elo but sometimes hurts. Also, whatever we do, someone will complain that some other obscure option is missing. People analyzing games are free to use previous build without this change or even maintain their own fork what I'm sure a lot of them do.
If at some point pure NNUE becomes stronger, I'm sure someone will simplify it back.
I think we need a huge search retuning, especially for heuristics that use static eval. Look at fishtest now - basically everyting is passing/close to passing...
If nothing else, having a UCI option named "Use NNUE" that, even when true, might in fact not use NNUE seems confusing now. It seems there are now two possible evaluations: classical and hybrid.
Maybe replace "Use NNUE" with either "Use Classical Evaluation" or "Use Hybrid Evaluation". One or the other, and simply true or false.
Your understanding of the word "Use" is incorrect. Usage doesn't imply exclusivity.
@Sopel97 "Use Classical Evaluation" is certainly unambiguous (and is true by default). Set it to false to enable whatever chimerical NNUE-hybrid is currently considered "best".
Or, maybe "Enable NNUE" instead of "Use NNUE"?
Should I even keep testing eval patches with non-nnue settings? I would speculate that optimal static (and the success of patches) is rather different for conventional SF compared to a nnue version (where static is only used to evaluate rather unbalanced positions).
Yes, I propose that patches to the classical evaluation are tested with 'Use NNUE=false', the hybrid mode should not dictate, at least for now, how classical evaluation evolves.
But we somewhat use classical evaluation as a speedup. Wouldn't it be logical to pass [-3;1] once on NNUE = true to not regress there?
I think I would like to avoid that for now, NNUE is evolving so quickly that this hardly matters, but this can be revised later.
The mixed evaluation approach complicates things and has "only" brought 11 Elo (+ maybe further adjustments of the threshold value might gain a bit more on top). But wouldn't you like to keep the NNUE eval separated first, at least for a period of time and retest the hybrid evaluation again later? With that high amount of currently passed patches for NNUE, hybrid eval might be obsolete soon and it interferes with current rapid development with NNUE eval. With "hybrid" I also mean the approach of many currently tested patches that try to replace part of NNUE eval in certain types of positions with classical eval. That looks like making steps backwards.
If hybrid eval becomes obsolete it will be simplified.
Isn't it a problem the weak points of the NNUE Net can't be improved (or better: improvements can't be tested) when those weak points are being handed over to classical eval?
There will be a struggle between (evolved hybrid eval / optimized search) of a specific net architecture and other architectures of potentially higher ceiling. Up to a point it can be up to the external NN trainers to come up with convencingly superior stuff, but resource-wise its an asymmetric battle.
It would be nice for NN developers to have framework access for researching their potential. The 256halfkp selection might be like taking a 2 liter engine, and evolving it and everything around it (chassis etc) for it. A 4 liter engine will not fit and require different stuff. So once our 2 liter car is very tuned it will be very hard to justify the transition.
Obviously there is no easy solution to this local maximum architecture issue, one has to start from somewhere.
So one has to rely upon intuition on what would offer the highest long term potential and ride it all the way, a crucial decision.
Another idea is to initially offer parallel evolution of different architectures with comparable resources for a modest period and narrow down to the most promising one(s).
The other way is doable too but requires discipline: to stop optimization of 256hybrid once elo gains slow down, and transition to a different one by using the same gear and alter it. This might be more efficient if the optimal gear is similar, but it will be emotionally hard to step down a few dozens of elo and allocate effort there.
But also its not for granted that "bigger is always better". This might be true for UCT needing increased eval accuracy for filling the generalised gaps, but for SF it can well be that the highest ceiling is offered by a challenging synthesis of roles, as certain stuff could be done more efficiently outside of a NN.
Much like a F1 car using a mix of automatic, semi-automatic and manual stuff. If the driver is good he can do some stuff better than automatic modules (or good enough and profit from less weight).
@TonHaver right now training doesn't take hybrid into account. So if the weak point becomes better, we might be able to e.g. change the threshold or simplify all away. I personally would be very curious to see what happens when on optimizes a net taking into account its training data only needs to contain positions that it will later encounter in the hybrid approach. I suspect this criterion cuts out many really uninteresting positions (very unbalanced, essentially won anyway).
I think we need a huge search retuning, especially for heuristics that use static eval. Look at fishtest now - basically everyting is passing/close to passing...
Another run of xoto’s searchconsttune would be nice.
Looks like incredible progress already: https://tests.stockfishchess.org/tests/view/5f2f0ff49081672066536b29
@vondele By the way, it might be useful to ensure the gains persist with SMP and scaling - there are some reported issues that since the hybrid patch, scaling is hurt badly. Any chance to run a 8-core RT?
I wonder if a smaller architecture that's as fast (or faster) than the classic eval has the potential to replace it in unbalanced positions, too.
@ssj100 those reports are very likely small samples only. Right now, it would be a waste of resources to run another SMP. There will hopefully be another wave of patches, and the next RT will be SMP. More importantly, we have an important issue to fix for SMP (https://github.com/official-stockfish/Stockfish/issues/2933), since right now, we (and all other NNUE branches) probably have wrong results (likely with little impact).
Read a previous post I have known @vondele doesn't like the idea of having the option to turn on/off hybrid mode. Reading some forum posts people are still discussing and love to have that option thus they can run some game analyzing and/or do some testing. I think that kind of use/testing may later help SF be stronger since they can get more knowledge anyway. Note that the majority don't know to code or compile SF.
Perhaps we can help them but using some "hidden" options which won't be listed when getting the command "uci", normal people don't know, don't use but if someone really wants they can know and use. Thus we can help people, solve the dilemma when still keeping policy/mainstream.
I don't really understand what it is doing yet, but I noticed that rotate180() really rotates over 180 degrees (^ 0x3f), which seems rather unnatural for chess.
(This may have to do with castling rights still not being taken into account by the NN? Or am I mistaken there.)
On an entirely different note, how will patches to the classic eval now be tested? If in the NNUE/hybrid mode, then I would expect basically any patch that speeds up the classic eval (by removing feautures) to pass (and to actually gain Elo, at the cost of classic mode).
Classical eval patches will be tested with 'NNUE false', we'd like to keep it in good shape.
castling right are not yet taken into account, but there are extensions of the network architecture that do take it into account. Testing those architectures is for the (near?) future, once we have some experience with the current setup and it is stable.
I haven't checked the role of rotate180 yet.
I suspect that the 180 degree rotation might come from the fact that shogi has a point symmetric starting position, so if there is no other reason for that choice I would agree that reflecting the axial symmetry of the chess starting position in that transformation would make sense.
3 Ltc Tests passing with new net. Lowering hybrid threshold and adding new term and raising hybrid threshold. Raising threshold should be good at vltc where speedup isn't as important. With new net it might not have pawn blindspot anymore?
Can someone make a branch that loads two different nets for a good base for these type of experiments?
The question is not quite clear to me.
Note that this patch https://github.com/Vizvezdenec/Stockfish/compare/add890a10b...5129aab83a almost certainly also increases the hybrid threshold on average. You can use dbg_mean_of(foo);
in the code to see the average value of a term, during a bench.
I'll create this new issue to track new ideas and channel early questions.