official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
11.6k stars 2.28k forks source link

Negative regression??? #2531

Closed adentong closed 4 years ago

adentong commented 4 years ago

-2.38 +-4.7 after about 5000 games. obviously still very early, but after 7 elo gainers this is not what I was expecting...

NKONSTANTAKIS commented 4 years ago

The combo of 7 gainers has not been tested yet, @snicolet made 2 files, the one tested atm includes all functional patches. Its basically the initial RT minus non functional.

locutus2 commented 4 years ago

@NKONSTANTAKIS That with the simplifications is indeed a problem. Perhaps a increase of LTC bounds like [0,3] would help. This has several advantages:

Disadvantage is that less LTC passes but if we avoid more regressions it seems reasonable. But this has to be simulated for real anwsers.

NKONSTANTAKIS commented 4 years ago

@locutus2 Logical direction but a bit too much imo. Halfway there with 0,2.5 and assess its safer, and to categorise simplifications into light (borderline parameter tweaks) and heavy (considerable code removal), with different bounds

Vizvezdenec commented 4 years ago

I firmly believe that it's better to increase lower bound to > 0 value. Since it's THE BEST (and the only) way to decrease probablily of patch regressing at LTC. Everything else is half-measures.

NKONSTANTAKIS commented 4 years ago

Considering the longterm perspective, I now agree with @Vizvezdenec. It ensures that added code will be worth its weight. Fewer elo gainers, less need for simplifications, cleaner code. So a lot of hidden economy involved. 0.25,2.25. It will be expensive though, its a bad idea to feed ltc as much as now. Stc can still be easier, just not as much as now. -0.5,3 + tc increase

Edit: Just realised proposed stc its not (much) easier, just more noisy. We would like it to be a tad easier than ltc, as scaling pillow. So -0.5,2.5, as easy as now but less false positives, so more accurate.

Vizvezdenec commented 4 years ago

https://github.com/Vizvezdenec/Stockfish/compare/a910ba7...3e1d8c7 number one https://cdn.discordapp.com/emojis/648699415099605002.png?v=1

vondele commented 4 years ago

I believe it is increasingly clear that a number of 'Elo gainers' passed through our STC [-1, 3] and LTC [0,2] bounds, that turn out all to be slightly negative in true Elo. This is possible in the statistical framework that fishtest is. The most natural and simplest way to deal with this is to adjust the bounds.

However, let's first look carefully if there are other possible causes for what we observe.

Vizvezdenec commented 4 years ago

I'm just saying that I was always concerned about this new bounds. And now we are getting a solid proof that in medium-scale they can introduce regression masked with "elo-gain". I think that our goal at first is to be confident that what we introduce is an elo gain and not to increase number of code changes just for the sake of more passing patches. I honestly liked older bounds more, yes, it was much harder to get positive STC to let LTC run but at least you could've been confident that LTC is not trash with big probability. Nowadays we are (more or less) picking random patches and let them run on LTC to see which one flukes out to be green. It's okay but if we do so we SHOULD make lower passing bound more strict. So at least LTC SPRT will give us quite big confidence that we are not accepting some junk (I'm not trying to be offensive, one of this "simplifications" is my patch, I'm just quite pissed with clean regress we managed to push).

Vizvezdenec commented 4 years ago

As I already sad, I had multiple cases where I had 3-4 passed STCs on the same idea that failed LTC badly and STC retest had shown also negative perf. But with some % one of them could've been positive for sure.

NKONSTANTAKIS commented 4 years ago

Even before sf11 we had statistically strange RT results with new bounds. Maybe its best for stc and ltc to have equal (or similar) confidence, like the old 0,5 0,5 era. Ltc confidence is more valuable but stc confidence is cheaper. But not equal elo bar, stc needs to be easier.

0,2 stc + 0.25,2.25 ltc

Vizvezdenec commented 4 years ago

well ngl this results was also with old bounds. But in my memory in past 3 years it's the first case of clean and not bug-related regression of master.

NKONSTANTAKIS commented 4 years ago

Yes, no clear proof but the problem was probably masked by error margins + strong elo gainers carrying the bad ones. We are very lucky now to not have a +5 elo patch camouflaging the issues

NKONSTANTAKIS commented 4 years ago

I have a full proposition, with synthesis of all viewpoints: to keep adequate widness of 2, and tweak only the elo bar in relation to code.

Code adders: 0,2 + 0.25,2.25 Tweaks: -0.5,1.5 + -0.5,1.5 Trivial simpl: -1,1 + -1,1 Real simpl: -1.5,0.5 + -1.5,0.5

NKONSTANTAKIS commented 4 years ago

Along with 3 more propositions:

  1. Stc to 15+0.15, for more correlation
  2. RT book = patch book , for less noise
  3. Base ct decrease to old rule, for higher optimization coherency of CT0 to CTdef

How about initially voting on each one seperately, in order to get the community signal and then decisions fall to area of responsibility?

snicolet commented 4 years ago

What is the theoretical probability of a test failing SPRT(0..2), given that it has already passed the same SPRT(0..2) once?

Vizvezdenec commented 4 years ago

I can't say that I'm all that for voting to decide stuff, not gonna lie. Votes will be including a lot of people that didn't write a single patch and this people (no offence) naturally know and understand much less of how sf improvement works than people who try to improve sf on a constant basis. Maybe I sound cocky but it's the truth.

vondele commented 4 years ago

@snicolet... I just answered that question elsewhere... If we assume the patch has a true Elo of 1, it fails a second SPRT test with 50% chance.

In general, it depends on the prior distribution of patches you have.

To give an example:

Alayan-stk-2 commented 4 years ago

Some thoughts :

The tests currently running should help identify the patches behind the regression, and remove them.

But moving forward, to avoid this happening again, the solution obviously is to revise the bounds to make a fluke less likely. The 0 elo passing probability is not the only metric to look at ; noisy bounds make -0.5 elo and -1 elo much more likely at equal 0 elo passing probability.

@NKONSTANTAKIS Please avoid publishing several messages in a row, prefer editing your latest message, or wait a bit before pressing the comment button to be sure you didn't forget anything important. It makes the discussion more readable.

EDIT : also, voting is a very bad idea. People who have thoughtful opinions can express them and contribute to the discussion, but in the end the maintainer should take a decision. People with a minor implication in the project or a poor understanding of the issue should not have the power to decide.

xoto10 commented 4 years ago

My thoughts are:

E.g. Something like : STC {-0.5, 3.5} 15+0.15 LTC {0.25,2.25} 60+0.6

I based the increases on simple eyeballing of the sprt graphs linked from fishtest, no doubt the maths guys can come up with more rigorous assessments of the increases required, e.g. using the results of the tests snicolet is currently running

NKONSTANTAKIS commented 4 years ago

Sadly I was misunderstood. By voting I didn't imply any responsibility for those in charge to act upon vote majority, or votes to have equal value. More like as a missing signal for people who don't post. Everyone has his own unique point of view. Somehow those need to interact. If some think that voting info will do more damage than good its a respected opinion. We could make a pre-vote about voting on bounds. The truth is that the effort I put, is not that respected because I don't write patches. Hence what I express is not judged equally, that is natural. AIso I have full awareness that my unconformist character is an annoyance to many. This is the reason I occassionaly dissappear, doing probably more self-rewarding stuff. But I have have hard time abstaining for long. So you could even vote if my involvement is overall positive or negative, I will use it wisely.

vondele commented 4 years ago

@Vizvezdenec could you please merge your 5 passed simplifications/reverts (1, 3, 4, 5, 6) into one branch and perform a regression test wrt. SF11 at our standard conditions ?

vondele commented 4 years ago

@snicolet did the sequence of tests Test how "..." yield some insight that you can share

adentong commented 4 years ago

@vondele What about the simplifications? For example the linux large page commit failed to pass LTC nonregression: http://tests.stockfishchess.org/tests/view/5e30ab0bab2d69d58394fdf1. Though most of the simplifications were non functional, so they probably didn't matter.

adentong commented 4 years ago

Furthermore, do we also want to look at commits before SF11 release and after we switched to pentanomial and new book?

vondele commented 4 years ago

large pages also passed. http://tests.stockfishchess.org/tests/view/5e32c748ec661e2e6a340d96 this was already mentioned, failing to pass a non-regression test doesn't mean a test regresses.

Your second question, I think we can try to run simplification tests on pre-SF11 patches as well, but not right now, to keep some order.

vondele commented 4 years ago

@Vizvezdenec combined 5 patches tested here http://tests.stockfishchess.org/tests/view/5e334098708b13464ceea330

31m059 commented 4 years ago

@vondele @Vizvezdenec The combined simplification has passed quickly:

LLR: 2.94 (-2.94,2.94) {-1.50,0.50} Total: 7829 W: 1079 L: 964 D: 5786 Ptnml(0-2): 52, 690, 2281, 781, 65

vondele commented 4 years ago

and ongoing regression test after the 5-fold revert: http://tests.stockfishchess.org/tests/view/5e334851708b13464ceea33c

Vizvezdenec commented 4 years ago

ugh sorry I was sleeping :)

adentong commented 4 years ago

So even with the 5 patches reverted there's still no elo gain...

Vizvezdenec commented 4 years ago

well you can't be sure there is none, maybe it's let's say like 1 elo. But at least it's not a clean regression :)

31m059 commented 4 years ago

Do we continue trying to write new patches for Stockfish? Or should we halt development until we've made changes? I currently have a test pending, but I'm not sure myself...

The framework is currently mostly idle...

adentong commented 4 years ago

We need to rethink our testing methodology, otherwise it's mostly meaningless when 5 out of 7 elo gainers can be simplified away just as easily.

vondele commented 4 years ago

@Vizvezdenec could you today turn these 5 patches in to 1 PR, with a commit message what has been reverted and with the test results. I'll try to merge tonight.

vondele commented 4 years ago

This exercise (which required more than 1M LTC games to sort out, thanks @noobpwnftw ) has shown that the bounds to filter out(in?) the Elo gainers are not strict enough, which means the lower bounds of our SPRT tests need to go up. At the same time, I don't want to make it more difficult for an 1-Elo patch to pass our testing, on the contrary. This means the average of the two bounds has to stay or reduce. These two requirements imply narrower bounds for testing, and thus more resources need to be invested per patch. This will avoid tests that pass with <10k games, but obviously some will need >100k. I don't want to change at the same time the TC of testing, so we can clearly see the effect of the bounds changing. Yet, by reducing a little the requirement STC, we can facilitate 'good scalers' to pass.

As a result I propose:

Give a thumbs up if this idea makes sense, even if the precise bounds deviate a bit from what you would have picked. If there is some community buy-in, we can merge these changes tonight or tomorrow in fishtest. Meanwhile we continue use fishtest with the current bounds.

31m059 commented 4 years ago

@vondele I think the expected number of games at LTC is a bit lower than your calculations there--unless I've made a mistake of course :)

The expected number of games is at least partly determined by the draw rate, which is very different between STC and LTC, and the default on the SPRT Calc corresponds closely to STC. Using a draw rate of .705 (based on the recent LTC tests), we have a more optimistic estimate of 137k games for a +1.0 Elo patch with those bounds. That would make a big difference in terms of computational feasibility.

Vizvezdenec commented 4 years ago

Eh, I don't have time to do it now. Can probably do it somewhere near 10 hours from now :)

ddobbelaere commented 4 years ago

Doesn't it also make more sense to take [-1, 1] for simplications?

In the proposition of @vondele a +1elo patch has 50% chance of passing and its revert 50% chance of being simplified away.

vondele commented 4 years ago

@ddobbelaere not quite right, a simplification that removes a 1Elo feature has about ~4% chance to pass both STC and LTC (quick check, correct if wrong).

vdbergh commented 4 years ago

@ddobbelaere There is currently already an issue that a neutral patch has relatively high probability of failing a simplification test. SPRT {-1,1} would make that worse.

The probability of simplifying away a 1 Elo patch away is far less than 50%.

Vizvezdenec commented 4 years ago

I prefer STC to be smth like {-0.25, 1.75} 0.75 elo 50%, negative patch overall will have like 0,18% chances of passing, also upper bound being the same looks aesthetically more right :)

ddobbelaere commented 4 years ago

@vondele @vdbergh You're right, sorry

My statement holds for a +0.5 elo patch, (0.5 lies in middle of [-0.5; 1.5], -0.5 lies in middle of [-1.5; 0.5], so 50% chance of passing STC). Actually with the proposed bounds, it's still less likely that a 0.5 elo patch passes both STC and LTC (with higher bounds than STC) than that a -0.5 elo simplification passes both STC and LTC.

Anyway, I think the additional 'constraint' that most simplications don't just revert earlier patches but actually make the code more readable/smaller is helping us here.

xoto10 commented 4 years ago

If we focus on -0.5 Elo patches, my understanding is the numbers change from 10% STC pass and 1.2% LTC pass now, to 5% and 0.3%, is that correct? The change sounds reasonable on that basis, but it doesn't feel like only 1.2% of our -0.5 Elo patches have been passing LTC - is the SPRT calculator underestimating this somehow? If it is, is this change likely to be good anyway, or can we refine it?

Edit: "have been passing LTC"

xoto10 commented 4 years ago

Do we continue trying to write new patches for Stockfish? Or should we halt development until we've made changes? I currently have a test pending, but I'm not sure myself...

Try some simplifications?

Vizvezdenec commented 4 years ago

@vondele so what about new bounds? I'm starting tests with the ones you proposed for now, want to really get back into writing something and not debugging ;) .

vondele commented 4 years ago

preparing a fishtest commit right now...

Vizvezdenec commented 4 years ago

Okay, I just started some tests with adjusted bounds (customly selecting them).

vondele commented 4 years ago

The regressing tests have been reverted https://github.com/official-stockfish/Stockfish/commit/6ccb1cac5aaaf7337da8b1738448793be63fdfdb and the new SPRT bounds introduced https://github.com/glinscott/fishtest/commit/cc84d9c57e5224144b5219605f20f5b82c66c642

This issue has been resolved. I'm sure the process has been useful, and I hope the new bounds help to make progress quickly and in a robust fashion.

mstembera commented 4 years ago

https://github.com/official-stockfish/Stockfish/issues/2538

MJZ1977 commented 4 years ago

I am back after some months of absence, so sorry to react with a lot of delay.

This is just my personal feeling about this huge change in SPRT bounds :

So I will kindly suggest a small adjustment : [-0.5, 2] for STC. This will significantly decrease number of games for ~0 ELO patchs which are not a potential big improvements to SF. The drawback is a small increase in +1ELO failures but after all this should be acceptable if we concentrate our efforts and ressources on better patchs.