official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
11.31k stars 2.25k forks source link

Negative regression??? #2531

Closed adentong closed 4 years ago

adentong commented 4 years ago

-2.38 +-4.7 after about 5000 games. obviously still very early, but after 7 elo gainers this is not what I was expecting...

joergoster commented 4 years ago

Look at the number of games taken for most of the passed elo gainers. Totally unreliable, imho.

Alayan-stk-2 commented 4 years ago

SPRT elo estimates are not accurate, but the confidence of superiority is very good. It's designed to stop once there is a very high confidence it doesn't regress and a high confidence it progresses.

The regression test is far from finished, but at the very best it will be something like +1 or +2 elo. That's bad news.

NKONSTANTAKIS commented 4 years ago

It will be valuable to repeat RT with just the elo gainers after sf11. (Master - ( simpl + non funct)). In the past a couple of this kinds of tests showed no problem but that doesnt mean it will be forever no problem. Now its also different bounds and statistics.

NKONSTANTAKIS commented 4 years ago

Another interesting RT will be with the noob3 book, as the optimization we do for it might not correlate that strong with 8-moves performance. If thats the case, probably its a good idea to do RT's on noob3, for better tracking of progress, more accuracy and less panicking.

NKONSTANTAKIS commented 4 years ago

Another thing to examine is 0,2 LTC possibly being too easy, allowing too many false positives through sheer number of tests. We basically rely on a single test, current STC adds very little LTC confidence.

Sopel97 commented 4 years ago

With current settings a probability of a <0 elo patch passing LTC is 5% according to https://tests.stockfishchess.org/html/SPRTcalculator.html?elo-0=0&elo-1=2&draw-ratio=0.61&rms-bias=0. NKONSTANTAKIS made a good point about the book being different. If it is indeed the book making a difference then it's gonna be fun...

Rocky640 commented 4 years ago

If we only look at the 15 tests between SF11 and 200128, 3 things stands out and they happened in the most recent 5 tests committed

a) bench jumped from 4725546 (test no 14) to 5545845 (test no 15) on 200128

b) 2 LTC finished before 10000 games (test no 12 and test no 14)

c) http://tests.stockfishchess.org/html/live_elo.html?5e30ab0bab2d69d58394fdf1 (test no 11) is currently struggling, although it already passed a STC regression http://tests.stockfishchess.org/html/live_elo.html?5e30ab0bab2d69d58394fdf1

I therefore suggest to test SF11 against test no 10, which is https://github.com/official-stockfish/Stockfish/commit/6d0eabd5fe2961551477820ab7619e2c31e01ffd

and then we can continue bissecting up or down from there.

Rocky640 commented 4 years ago

https://www.sp-cc.de/ tested Stockfish 11 official 200118 playing against a pool of AB engines and compared with Stockfish 200107 playing against same pool of engines, at 180+1

He says "AB-testrun of Stockfish 11 finished. Sadly a clear regression to Stockfish 200107 (-7 Elo)"

This is only one result, but there is a puzzling coincidence: the first commits of test which used the new pentanomial model, 3 moves book and new SPRT bounds were done on January 7. Maybe the problem started there ?

Alayan-stk-2 commented 4 years ago

The error bars with 5K games against significantly weaker opponents do not allow to tell with a reasonable degree of confidence that SF11 would be inferior to SF 200107.

31m059 commented 4 years ago

@Rocky640 That is possible, but certainly would be surprising...our own regression testing showed a slight single-threaded gain from January 7 to SF11. While that is close-to-within error bars, we should have detected a 7-Elo regression...

adentong commented 4 years ago

Should we temporarily disallow any new patch submissions to fishtest and stop all currently running ones, so we can focus our attention on this issue, since we'd probably want to run a bunch of regression tests to pinpoint the problem? Seems like this could have some pretty big consequences on the future of SF's improvement.

ddugovic commented 4 years ago

The fishtest server has a priority field (which could be increased for regression tests, if those are a priority).

vondele commented 4 years ago

let's look at this carefully, but without hectic ...

first, the results so far (RT not yet finished) is consistent with NCM https://nextchessmove.com/dev-builds where no single patch stands out.

second, as next thing I will test with the book which is now being used for development, out of curiosity to see if this matters.

31m059 commented 4 years ago

I don't think it will be necessary to stop all other ongoing tests, or to elevate priority...those are drastic measures. We can use throughput for a "softer" approach.

But since the new throughput calculation prefers positive LLR, @snicolet's non-regression verification test is going to lose workers as it progresses towards LLR = -2.94 (the more informative result, since failure may mean reversion of that commit). Therefore, I've artificially raised its throughput to 150% for now. Hopefully, this represents a good compromise...

vondele commented 4 years ago

@31m059 no need... failure of that test will not necessarily imply revert. First, I think starting those tests was premature, second, let's not forget the statistical nature of fishtest.

31m059 commented 4 years ago

@vondele My apologies, I've now restored the default throughput. I agree with your approach of exercising restraint here.

Vizvezdenec commented 4 years ago

okay, I didn't notice this topic, I will repost there :)

Vizvezdenec commented 4 years ago

I think it's about time to respin this discussion after quite dissapointing regression test (it's not finished but it's quite obvious that it most likely wouldn't finish positive). So, we made STC bounds really loose and now probability of patch being not a regression became 18% * 5%, so like 0.9%, it seems that it's too much - 7 elo gainers result in what seems to be slightly negative elo gain. I guess we should do smth with this.

Most obvious is that we probably should do simplification attempts for all 7 passed patches that made it into master since sf11 release. Probably just at LTC; It seems that 0,9% is too high of a chance for negative patch to pass. Probably since we want loose STC bounds to give more patches shot at LTC we should slightly stricten LTC bounds themselves since a lot of patches that are negative go to LTC and each of them has decent chance to pass. My proposition will be to change LTC SPRT bounds to {0.25, 2.25} or {0.5, 2.5} - second one is closer to 0,25% of regression chance we had like forever, first one will allow more patches to pass. I think that the stronger engine gets the stricter should be % of non-regression (yes, it's sad, because less patches will pass) because % of passed patches becomes lower and lower thus more and more patches are tested and more and more patches lie in "slightly negative" zone. Maybe we can also slightly move lower bound of STC. I think good compromise between everything can be STC {-0.5; 3}, LTC {0.25; 2.25}. Chances of negative patch will be smth like 0,26% - more or less what we used for years, we will have slightly less LTCs (which is, imho, a good thing, nowadays we run infinite LTCs most of them are not even close to passing), overall game number wouldn't increase this much, STC-LTC correlation will be slightly more reliable. I guess it's all from me for now, your opinion is really appreciated :)

pb00068 commented 4 years ago

@Vizvezdenec I appreciate your proposition but IMO it's to early to take measures. Let us first try understand (and measure) what went wrong.

Vizvezdenec commented 4 years ago

Well first stuff that should be done is trying to simplify passed tests with {-1,5; 0.5}, imho.

xoto10 commented 4 years ago

Trying to simplify the passed patches (but maybe not all at once!) seems like a good idea.

Alternatively, with 15 patches since sf11 how about testing the first 8 patches for progression / regression? If there is a single patch causing a problem this would be a first step towards identifying it.

NKONSTANTAKIS commented 4 years ago

The regression is not possible to have come from the LTC gainers. Even if we hitted the 5% "jackpot" 3/7 the rest would cover up the tiny elo loss, landing us into +1-2 at worst.

Nevertheless, STC -0.5,3 definitely agreed + 15" for extra correlation. LTC maybe then will be fine as is, (viz suggestion sounds good too, but it might rarify greens too much) so STC 1st needed step and reassess.

More on topic, Sinplifications with current -1.5,0.5 feel like passing easier than before, and one has to question their value in general. Imo if they are not removing serious code its not worth it. Now ie we have a visible ocb weakness (from ccc observation), because the cycle of small regression into new better formula didnt suceed (yet).

Large number of non-functional patches, untested, hold the risk of random unforeseen side-effects, like it seems to be in this case. Probably its wise to do "better safe than sorry" type of tests like sn often does, as neither humans nor compilers can be 100% trusted.

Vizvezdenec commented 4 years ago

Oh, again simplifications are the cultprit, of course. Heard it a lot of times and have never seen tests of 20 simplifications together being negative elo - it was done multiple times and NEVER had shown result worse than -1 elo. But "elo gaining tests" being actually negative even with old bounds is nothing that is unheard of and it's even more probable with current ones. Just saying that with number of patches we test and % of patches passing we should actually decrease % of patches that can pass being negative. Nowadays 0,25% threshold for passing patch is -0.3 elo and 0.1% is -0.5 elo , with previous bounds we had this as 0 elo (0,5 0,5 and 0,5;4,5 and 0; 3,5 respectively). With our goal to be better at 60+0.6 and mostly rely on this test as an indication of this patch being good, imho, it's pretty logical to make lower bound > 0 so LTC regressions will have really low chance to pass. Even at cost of lower number of patches passing in general.

owldw21 commented 4 years ago

imo starting a RT with only simplifications wouldn't hurt too much, and can determine if the SPRT bounds are too loose.

vdbergh commented 4 years ago

I think we should wait for the tests to finish before discussing anything. Note that from the test by @vondele it may well follow that we are just witnessing a case of selection bias. This could be a confirmation of the existence of such a phenomenon (on which I have speculated in many posts).

Vizvezdenec commented 4 years ago

Well, opening dependancy for sure can be a thing. But tbh it's within error bounds. Also I need to say - if LTC on 2 moves book ends with 0 it's still pretty bad.

NKONSTANTAKIS commented 4 years ago

Too many simplifications removed at once is extremely noisy test, as in between a lot of elo gainers were adapted on the basis of the simplifications. Turning them on at once would distort the elo gainers performance unpedictably. 20 is a ridiculous number, useless test. I dont understand the rationale for suspecting just the 1% chance of elo gainers regressing while at the same time -1.5,0.5 means that a -0.5 elo test is more or less a coin flip, so 25% to pass both and be merged.

Edit: An interesting usage of a new master - 20 simpl vs master test would be as comparison with an old master + 20 simpl. This way one can measure their relative dependency of that period.

NKONSTANTAKIS commented 4 years ago

Error 2 elo, book 2 elo, linux patch 2 elo, simpl 2 elo, sum 8 elo, hence -3 instead of +5. Easy. Case 1. Is unavoidable

  1. Easily eliminated with same book RT
  2. Requires constant caution for untested changes
  3. Elo is much harder to get now than it was a decade ago when -3,1 was established. We can adapt our price to simpl ratio to current needs. For example with -1.25,0.75
adentong commented 4 years ago

So from the regression tests it doesn't look like book is the problem. We also know that the linux large page commit failed to pass non regression.

adentong commented 4 years ago

Oof. Viz's test 1, 5, 6 already passed simplification, 3 is about to pass, 4 and 7 are still neutral, and 2 looks like the only one that's resisting the simplification.

31m059 commented 4 years ago

@adentong I have some concerns about simplification 6, since it's not really a simplification (it reverts a complexity-neutral parameter tweak for TrappedRook). But overall you're right, this pattern is striking.

Vizvezdenec commented 4 years ago

well I tested all "elo-gaining" patches since sf11 release regardless of them being param tweak or not. Next thing I want to do is to squash them into a single commit and test it on fixed 60k games against master on 8 moves book. Will do so closer to 17-18 Moscow time since I'm at work now :)

31m059 commented 4 years ago

@Vizvezdenec Which simplifications do you plan to combine for the fixed-games test? Just the ones that pass STC? (That would be my recommendation.)

You might also just run them as a LTC SPRT with simplification bounds...

Vizvezdenec commented 4 years ago

obviously the one that pass STC :) Yeah, but I want to see some estimate of elo on fixed games.

31m059 commented 4 years ago

Initial regression test is complete:

ELO: -2.47 +-1.3 (95%) LOS: 0.0% Total: 60000 W: 7490 L: 7917 D: 44593 Ptnml(0-2): 330, 5657, 18424, 5285, 303 https://tests.stockfishchess.org/tests/view/5e307251ab2d69d58394fdb9

xoto10 commented 4 years ago

So 5 simplifications by Viz have passed stc, are we going to run ltc for them individually or combine them somehow?

It seems as if something has changed and patches are passing stc and ltc too easily allowing elo losing changes into the codebase. Or is the problem that these simplification tests are passing too easily? Tricky.

vondele commented 4 years ago

@Vizvezdenec test is interesting IMO. 5 out of 7 'Elo gainers' can be simplified without regression (at least STC), for quite a few of them, the Elo estimate for removal is positive... I think these can be individually reschedule for LTC simplifications.

We'll have to reflect a little on what that means for our testing procedure.

Vizvezdenec commented 4 years ago

I think that this test will tell us more. http://tests.stockfishchess.org/tests/view/5e32b470ec661e2e6a340d66 If it also lands in firmly negative zone I think we should definitely rethink our SPRT bounds. :)

vdbergh commented 4 years ago

At least this one failed https://github.com/Vizvezdenec/Stockfish/compare/a910ba7...2907081 . This was what appeared to be an unambiguous Elo gainer: http://tests.stockfishchess.org/html/live_elo.html?5e2f767bab2d69d58394fd04.

Elo: 7.71 [3.37,12.05] (LOS 99.974%).

Vizvezdenec commented 4 years ago

2 tests failing simplification out of 7 is nothing to be proud of, imho...

vondele commented 4 years ago

@Vizvezdenec I started LTC simplifications

Vizvezdenec commented 4 years ago

okay I'm not at home now so can't do it myself anyway :)

xoto10 commented 4 years ago

If we can regularly pass and then simplify away the same test, that suggests our elo gainer and simplification bounds are too close together. I guess the poor regression test results suggest it's the elo gainer bounds that need to get tougher.

Vizvezdenec commented 4 years ago

Let's see how all tests finish, but if it's the case... I already proposed some solutions :P

snicolet commented 4 years ago

and @noobpwnftw has added a heap of new machines to the framework now! Thanks a lot! :-)

Vizvezdenec commented 4 years ago

thx @noobpwnftw Now we can only wait for data to converge and then make something out of it...

locutus2 commented 4 years ago

I repost here my comment from there: https://github.com/snicolet/Stockfish/commit/b64a9bba9a4f5466c5b4795527684170fd2164a7

It's diffcult. That some tests are false positives is normal but that most tests falls in this category seems really odd and is not likely. So we should look for a common source for this. Perhaps the implemention of new pentanomial model has some bug. What i have seen that for few games the error bar seems only about the half than if we use simple statistics (2*standardDev). I don't know if this normal. Perhaps we should also recheck the commits under the new model before SF 11.

Someone propose to change the SPRT bounds but perhaps its better to decrease the probability of false positive/negatives by reducing alpha and beta.

EDIT: I had not considered the draw ratio so the error bars are smaller. So i get around 20-30% more deviation more few hundred games than was displayed at fishtest. But i think that could be explained by the better aöpproximation with the pentnomial model.

NKONSTANTAKIS commented 4 years ago

Its a bit of both, elo gainers pass easier than before and simplifications pass easier too. As they are almost the reverse (-2,0 is too close to -1.5,0.5 !), not only any elo gainer has high chance to pass -1.5,0.5 but also any simplification has high chance to pass 0,2 when reverted.

So besides testing all elo gainers with simpl bounds, I propose testing reverting all (or some, like the ocb one) simplifications since -1.5,0.5 with elo gaining bounds.

Its evident that the change to logistic elo altered the analogies and lowered confidence. (as all tests resolve much faster)

Vizvezdenec commented 4 years ago

well so far it looks like combo of 7 "elo-gainers" is firmly negative vs sf 11. So stop blaming simplifications for it, at least for now. We tried to reintroduce OCB scalefactor with @locutus2 but it all failed LTC while passed STC a lot of times - there seems to be basically no elo there.

NKONSTANTAKIS commented 4 years ago

How about lowering contempt to reduce jitter to the statistical model. Beforehand we needed extra resolution, now it seems we need more stability of the signal.

Not by a lot, just to fullfill the old rule of not regressing vs ct=0. This would also result into more accurate optimization. To optimize everything for max gain ct24 v ct24 self play inevitably introduces some bias, maybe too much.