Record-high draw rate at fishtest

Alayan-stk-2 commented 4 years ago

The merge of NNUE has simultaneously increased the playing strength even more and effectively removed contempt.

Data from the time of SF7-SF8 RT showed about 80% draw rate at fishtest LTC. RT data also showed an upward trend for draw rate from SF version to SF version once corrected for contempt.

Right now, una's uc1 LTC sits at 86.3% draw rate. The completed uc3 LTC is at 86.5%.

It's hard to quantify how much this impacts the search for elo gainers, but it hurts. Measure to increase the elo-spread would be helpful.

nickolasreynolds commented 4 years ago

At VVLTC, "game length wins" are a very reliable signal of superiority in both drawn and busted openings.

That is, when both engines win an opening, the better engine wins in fewer moves. When both engines draw an opening, the better engine draws in fewer moves (on defense.)

If this remains true at Fishtest time controls, incorporating that information into the statistical models should increase Elo spread, and perhaps even make the testing process much more book agnostic.

vondele commented 4 years ago

Note that the high draw rate will make SPRT tests pass faster for the same Elo gain (see https://github.com/glinscott/fishtest/issues/746) Right now, we have no problems finding Elo gainers, but it might/will change. We have book and TC at our control quite easily, but it is too early to start changing this.

sf-x commented 4 years ago

There is work to re-add contempt.

locutus2 commented 4 years ago

I try currently to add contempt also for NNUE eval. Interrestingly the effect of contempt seems to be way lower than for classic eval. With my best version i get a gain of only 11 elo against SF 8. For classic eval it was around 30 elo. Now i test a non-regression against master.

EDIT: perhaps we have also to change the default contempt value. Currently i have left contempt 24.

vondele commented 4 years ago

I assume that contempt is already part of the training input of the nets, btw. Missing would also be a non-regression (or similar) test vs master. But, contempt optimization is an endless sink of resources, controversies, etc..

locutus2 commented 4 years ago

@vondele I haven't thought of training was probably be done with contempt. That would at least explain the lower effect.

locutus2 commented 4 years ago

The non regression seems to fail badly: https://tests.stockfishchess.org/tests/view/5f2fb3a19081672066536b90

As vondele pointed out contempt is probably part of learning. But contempt has a static and a dynamic part. The dnymic part is not contempt like because its gives greater magnitude to the eval (independent of the side of the engine) so this should be fully learnable for the net. The static part is the real contempt part and depends on the side of the engine. The NN can't learn this because it has bot input feature which gives this information. So a position gives the same eval independent of the contempt side so this should average out.

So i will now try to use only the static part of contempt.

Alayan-stk-2 commented 4 years ago

Note that the high draw rate will make SPRT tests pass faster for the same Elo gain

Yes absolutely, but high draw rate is not only associated with strong play, it's also associated with safe play. When the engine goes for simpler positions, it's less likely that the better version will be able to outplay the weaker one.

gekkehenker commented 4 years ago

I assume that contempt is already part of the training input of the nets, btw. Missing would also be a non-regression (or similar) test vs master. But, contempt optimization is an endless sink of resources, controversies, etc..

The original net was trained on SF with contempt.

But I'm not sure training on contempt adds a natural contempt in the net. If anything I'd assume it plays in an anti-contempt way.

The other problem is that the effect of contempt in depth 8-12 data is very hard (if not impossible) to measure.

Sergio's nets are being reinforcement learned on selfplay and low lambda. (takes into account game result more than eval) Even if the original net had traces of contempt in it, that must be all gone by now.

If you want a net with build-in contempt the training data would have to show a benefit of playing with contempt.

mstembera commented 4 years ago

You could play SF w/ contempt vs a slightly time handicapped SF to generate data that would show the benefits of contempt.

MJZ1977 commented 4 years ago

Right now draw rates at LTC are close to 90%. It seems to me too high in the sense that LTC tests can be rejected quickly after some game loss. Perhaps it will not change anything at the end to correct this issue, but it is little strange after long period of draw rates at 75%.

noobpwnftw commented 4 years ago

We just fast-forwarded 2 and a half years worth of "normal" progression so I think it is normal but someone might want to redo the math about bounds, probably book and so on.

dorzechowski commented 4 years ago

Old 2moves_v1 book, while having similar resolution, had much higher bias than current default one last time we checked. Maybe we could check what drawrate it produces? It would be an useful data point to think about constructing future books for development.

vondele commented 4 years ago

Math concerning SPRT takes drawrate into account. It is expected that high draw rate leads to fewer games needed for the same confidence with SPRT: https://github.com/glinscott/fishtest/issues/746

Book hasn't been investigated, so far.

MichaelB7 commented 4 years ago

I agree with @vondele - but the 30,000 opening book ( link below) will reduce the draw rate dramatically, but the positions here are heavily one sided. Might be useful in a test environment to see if we could gain some efficiency with a dramatically unbalanced book, but I am skeptical, but somewhat interested to see how these openings would do in the Sf environment. My own testing shows draw rates at around 53% with nearly equal engines , and far lower when the engines are not nearly equal. https://www.sp-cc.de/files/5mvs_30k_analyzed.zip

noobpwnftw commented 4 years ago

The recipe for making a good test book is not well understood. However I have made a new 3-move test book using the same method as the previous one, but with more developed data. https://www.chessdb.cn/downloads/3moves_v2.zip Note that its performance characteristics and properties is not yet tested, not sure if it is even better.

vondele commented 4 years ago

I've locally run some tests 127 threads @ 30s+0.3s and the drawrate is >95% (on noob_3moves.epd) it is quite amazing.

zz4032 commented 4 years ago

I can offer my book collection, generated with SF and sorted by cp values. The one-sided books "cp80-112" and "cp113-150" should be a good choice for Fishtest conditions. book_3moves_cp-sorted.zip

vondele commented 4 years ago

@noobpwnftw and @zz4032 and @MichaelB7 Thanks for book offers. My experience with one-sided books is that they reduce the draw rate but decrease Elo separation. Obviously, there are different ways to generate those books and they could lead to different results. Ideally you can do a test where to SF versions (e.g. SF11 SFdev) run a match on both the new book and the noob_3moves.epd and compare the Elo difference.

Meanwhile I did some testing with many threads and the results are quite interesting as well. I was in part trying to answer the hyperthreading question, but the results are relevant also here. Running at 30s+0.3 on 2x EPYC 7742 (128 cores, 256 threads, avx2 builds), I ran 3 matches. 2 with different threads (127 vs 254) and one with the same (254 vs 254), all with master using either classical or NNUE eval. and the normal noob_3moves.epd book:

classical vs classical (note threading difference)

Score of 127 vs 254: 7 - 14 - 110  [0.473] 131
Elo difference: -18.6 +/- 23.7, LOS: 6.3 %, DrawRatio: 84.0 %

nnue vs nnue (note threading difference)

Score of 127 vs 254: 2 - 5 - 175  [0.492] 182
Elo difference: -5.7 +/- 9.9, LOS: 12.8 %, DrawRatio: 96.2 %

So a benefit for hyperthreading, the error bars are large, not inconsistent with the benefit being the same for classical and nnue (see also https://tests.stockfishchess.org/tests/view/5f34a3039e5f2effc089da81). The relevant part here is the drawRatio at 96% (despite the short 30s TC)

nnue vs classical (same number of threads, 254)

Score of nnue vs classical: 40 - 0 - 111  [0.632] 151
Elo difference: 94.3 +/- 26.3, LOS: 100.0 %, DrawRatio: 73.5 %

amazingly, not a single loss by nnue in 150 games.

ssj100 commented 4 years ago

@vondele is that testing with "nnue" with the latest hybrid binary?

vondele commented 4 years ago

of course, would I run forks ? ;-)

ssj100 commented 4 years ago

@vondele Wow, and against the latest "classical" SF12dev? Not SF11?

vondele commented 4 years ago

yes, all with master using either classical or NNUE eval.

ssj100 commented 4 years ago

Right, but as we know, classical master likely has regressed compared to pre-NNUE master if I understood this correctly: https://github.com/official-stockfish/Stockfish/issues/2981 "the classical version of stockfish has lost ~7 elo since the introduction of NNUE due to search changes. This shows that it's infeasible to maintain two versions at the same time using the current developement model."

Regardless, that is still very impressive of SF NNUE hybrid beating it by over 90 elo and unbeaten over 150-games at hyper-bullet conditions! Granted, 254-threads, but still...

However, I actually thought what people were debating was whether latest SF NNUE hybrid was weaker against SF NNUE pure, particularly with SMP. And if not, is the hybrid actually the best way forward? That is, is using hybrid this early the best way to reach the highest "elo ceiling"? Several people think not, and some think it's very obviously not the way.

Your thoughts? @vondele

vondele commented 4 years ago

hybrid seems like a great topic for bikeshedding

It will just evolve like anything else, if we find it is an Elo loss at testable conditions in general, there will be no hesitation to remove it.

I'm still hoping for the first net that exploits this successfully, i.e. only trains on the positions it will actually be called for. One can not imagine the kind of weird positions eval gets called on, and anything will do fine to just reject those. Better focus the net capacity on what matters.

dorzechowski commented 4 years ago

The aforementioned 7 Elo regression is a red herring. It was measured at STC and 1 core and in the conditions above it most likely shrunk to nearly zero. Without any significance to the +100 Elo result.

With over 96% draw rate we enter draughts territory from 30 years ago. :-)

syzygy1 commented 4 years ago

As some may have already commented, a high draw rate is PERFECT for fishtest. The last thing you want is replacing draws with pairs of wins and losses.

sf-x commented 4 years ago

As some may have already commented, a high draw rate is PERFECT for fishtest. The last thing you want is replacing draws with pairs of wins and losses.

Fishtest now counts them the same as 2 draws, I believe.

syzygy1 commented 4 years ago

As some may have already commented, a high draw rate is PERFECT for fishtest. The last thing you want is replacing draws with pairs of wins and losses.

Fishtest now counts them the same as 2 draws, I believe.

Probably not, at least not when deciding whether a patch should pass.

vondele commented 4 years ago

I think the comment refers to using pentanomial statistics in fishtest. Those are enabled and will estimate the variance a little better.

ssj100 commented 4 years ago

So the SMP RT doing 10-15 elo worse than the 1-core RT again. Is this reasonable to expect now?

vondele commented 4 years ago

possibly, and possibly just an artifact of the testing methodology (hyperthreading, elo compression since different effective TC, etc..), a priori there is no reason to expect them to be the same, only similar.

ssj100 commented 4 years ago

Interesting that the difference only came with the hybrid patch though. The first NNUE merge (pure NNUE) had the SMP RT at around +3 over the 1-core RT. Then since the hybrid patch (amongst other changes of course), the 1-core RT demonstrates back-to-back 10-15 elo "superiority".

noobpwnftw commented 4 years ago

So it can also be that patches made 1-core RT better, although doesn't scale, which seems logical given such high draw rate, little room for improvement.

ssj100 commented 4 years ago

Just saying though, odd that it all scales perfectly until +80 elo over SF11, and then suddenly loses 10-15 elo after +100 or so elo...

noobpwnftw commented 4 years ago

It seems getting better, without the help of many AVX512 workers(which I expect will increase ~5-10 elo) the current RT is gaining more on 8-core compared to previous ones.

ssj100 commented 4 years ago

Current RT doesn't quite make sense though - 8-core appears to have gained 1-3 elo, while 1-core appears to have regressed 1-3 elo!

noobpwnftw commented 4 years ago

Previous RTs had those workers that pushed 1-core higher than normally should be, but had not much effect on 8-core RT due to hardware characteristics. This further proves that the behavior is well understood. You can run a 8-core RT with "pure" NNUE, I'm quite sure it wouldn't get any better.

ssj100 commented 4 years ago

Right, that makes sense then. RT probably needs to have consistency in what workers are run to have some reliability I suppose?

noobpwnftw commented 4 years ago

Well the discrepancy is caused by the following: 1) running NNUE come at the cost of down-clocking the whole CPU, so when running 1-core RT, the base SF is therefore weaker. 2) when running 8-core RT, the base SF is much less or not affected by down-clocking, so the elo difference is lesser compared to 1-core RT. 3) the extra AVX512 workers would naturally show higher elo diff due to being able to run NNUE faster, and on newer architecture which seems to have less impact with down-clocking, this explains why they can easily push 1-core results, but still less helpful in 8-core case.

My conclusion is that the lower than expected 8-core RT results are entirely due to such effects, which is in fact very consistent across multiple RT runs, as described by above scenarios, and in fact, more reliable than 1-core tests which can be influenced by too many factors.

This is unfortunately the way how CPU works and we should just prove the case and have it documented somewhere. Many people have complained for chips having such behavior(being unable to properly provide reliable speed or isolation given a mixed workload involving AVX).

vondele commented 4 years ago

and, btw, I augmented https://github.com/glinscott/fishtest/wiki/Fishtest-faq#why-is-the-regression-test-bad with a bullet point number 3 before I started the RT...

crocogoat commented 4 years ago

Maybe in a while such a "pure" NNUE RT could still be done to see if the result vs SF11 is about the same. That would be the easiest way to remove most doubt about any potential scaling issue.

vondele commented 4 years ago

we know that 'pure' NNUE regresses at short TC, and is at best equal at SMP LTC.

gekkehenker commented 4 years ago

W/L-ratio of SMP RT is (much) bigger than the W/L-ratio of the LTC RT.

There's a very good chance we're already getting close to the max amount of elo you can reasonably expect to get against SF11 in SMP conditions with the 8 move book. (Elo-compression as mentioned by Vondele)

jhellis3 commented 4 years ago

Given the point noobpwnftw, it makes sense to simplify hybrid away if it can pass a multi-thread non-regression test. Nobody intent on using SF for serious analysis is going to be running at 1 thread.

vondele commented 4 years ago

since the start of this thread, we've changed sprt bounds, adjusted the scaling of TC, and improved the usage of the book (all unique positions). I'll close this thread for now.

official-stockfish / Stockfish

Record-high draw rate at fishtest #2953