official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
11.53k stars 2.27k forks source link

SPRT parameters improvement #1859

Closed mcostalba closed 5 years ago

mcostalba commented 5 years ago

After a very long and interesting discussion https://github.com/official-stockfish/Stockfish/pull/1804#issuecomment-445429885 I think it's time to summarize in quantitative way what people have found and proceed in a practical / executive fashion.

The topic is the improvement of SPRT parameters aimed at:

It is immediately clear that the 2 goals go in opposite directions, anyhow from the discussion it seems there is some consensus to allow a reasonable increase of test resources in exchange for stricter parameters.

To allow the discussion to be quantitative and on the point, I'd like people post here only simulation results in a simple and standardized format:

Limits SPRT[0, 5] STC + SPRT[0, 5] LTC
0 ELO pass prob xxx
1 ELO pass prob xxx
<= 0 ELO pass prob xxx
>= 1 ELO fail prob xxx
Avg. STC cost xxx
Avg. STC + LTC Cost xxx

The first 2 records give a taste of the sensibility of the scheme, the 3 and 4 give a taste of the failure rate of the scheme and the last 2 of the cost of the scheme.

Now, on the cost. Patch's ELO is not uniformly distributed, namely the biggest part of tested patches are within [-2, 0] ELO (neutral or slightly negative). We have to consider this to compute a standardized cost.

I propose the following. Given:

ag(x) = Average number of games for pacthes with x ELO

We define the cost as:

STC Cost = 10 ag(-2) + 35 ag(-1) + 40 ag(0) + 25 ag(1) + 5 * ag(2)

For the STC + LTC cost we have to consider that:

LTC Cost = STC Cost * 6

Given that all the above is not trivial to implement, I would split this project in 2 phases;

  1. First find a standard tool to reliably compute all the above. Publish it and let people refer to the same measuring tool.

  2. Once we have an open, shared and validated tool then compute the different schemes.

The standard tool should be sound. I would prefer a tool based on simulations more than on formulas, it sounds to me simpler and easier to review for a wider audience of people, and also more flexible. Everybody is encouraged to submit their script / simulator. Once the process has stabilized on a shared and validated single tool, then we will proceed to the second phase.

It won't be quick, but this is an engineering approach to the problem: solid, slow and boring :-)

Vizvezdenec commented 5 years ago

Actually my 2 cents, but improtant two (imho). Right now we are able to utilize fishtest hardware mostly because I'm writing a crapton of patches. http://tests.stockfishchess.org/users/monthly Vizvezdenec 718 - it's more than third of total number of patches on fishtest at this time. With this amount of written patches I'm basically trying almost random things - mainly because I think that testing something is better than testing nothing. With tighter bounds I'm perfectly fine to write like 3-4 times less and free up a lot of hardware for other devs - right now this way fishtest will go idle pretty often... Which wouldn't be the case with tighter bounds. So, imho, tighting bounds will also have side effect of increasing % of elopositive patches :)

Alayan-stk-2 commented 5 years ago

The elo_gain/STC_game is in my opinion a poor metric, because the optimal point for this ratio are large bounds which let some easy gains through while quickly rejecting the rest.

For example ; with [0. ; 6.0] + [-0.5 ; 5.5] ; the tool gives me 79% of the elo gain for 68% of the resources usage. The ratio is better, but the bounds are worse, because they reject more good patches and create more uncertainty about the quality of patches.

It's hard to model what are the optimal bounds because it depends a lot on how people are using fishtest.

We have two fundamental "resources" : human finding ideas to test, and CPUs to test them.

When the bottleneck is on CPU resources, we want to increase the quality of patches submitted/reduce random tests/random tunes, and/or do quicker but less accurate tests.

When the bottleneck is on human finding ideas, we want to increase the testing quality, in order to 1)have a better idea of what a patch is worth, and if the idea may warrant further exploration ; 2)accept more "good but not outstanding" patches.

As @Vizvezdenec is pointing out above, what is bottlenecking SF progress is much more finding good ideas than testing resources, which is a situation where SF would benefit from increased testing accuracy.

EDIT : Also, let me remind everyone that when you want to extract more elo from a pool of ideas, each additional elo becomes harder to get. Hence, bounds with a similar/slightly lower elo/cost ratio but with a noticeably higher elo acceptance ratio are much superior.

NKONSTANTAKIS commented 5 years ago

It seems to me that everybody agreed upon those bounds and is just waiting for the implementation.

I'm fine with the proposed [0.5, 4.5] + [0,3.5] : indeed we can evaluate the change in a few months to estimate if it works in practice.

The most impressive feature in the discussion has been the quality of the argumentation, as usual! :-)

@vondele I'd suggest one change at a time, let's start with this one as a "beta", in some weeks we can eventually promote to "production" or, in the worst case, to revert.

@snicolet how do you suggest to deploy this change? Should we contact fishtest devs and ask them to change the defaults? Maybe even better is to open a PR on fishtest repo.

I think its very simple, just everyone to use the new bounds. What are we waiting for? @snicolet @mcostalba

vdbergh commented 5 years ago

For example ; with [0. ; 6.0] + [-0.5 ; 5.5] ; the tool gives me 79% of the elo gain for 68% of the resources usage. The ratio is better, but the bounds are worse, because they reject more good patches and create more uncertainty about the quality of patches.

A surprising fact nonetheless! I too get 6.3e-6 elo/STC_game which is much higher than the current 5.7e-6. But the average elo of an accepted patch drops from 1.44 elo to 1.32 elo. Of course softening the current hard barrier at 0 elo is not a good idea.

EDIT. Indeed it seems one can get very high elo/STC_game by using very permissive bounds. This is achieved by the tests finishing very quickly. So it is only a meaningful metric if CPU resources are the bottleneck (which apparently they are not currently).

vdbergh commented 5 years ago

I am wondering about scaling issues. With the normal prior I was using (which seems to fit the STC data very well) the expected elo of a patch that passes STC is 0.70. I have now downloaded the results of 286 LTC tests from fishtest and it seems the average elo of a patch submitted to LTC is only 0.27. It could however be that this is simply the effect of speculative LTC... The LTC data covers a longer period than the STC data which may also make the results less reliable.

EDIT. I redid the calculation with STC and LTC data covering the same period and the results were the same. I wonder now if it is a modeling issue (this has some relevance for the current discussion). There were some scaling tests in the past

http://tests.stockfishchess.org/tests/view/5910497a0ebc59035df34402 http://tests.stockfishchess.org/tests/view/59100d7c0ebc59035df343ef http://tests.stockfishchess.org/tests/view/590ffa180ebc59035df343e7

So the scaling STC->LTC appears to be good. Sadly the tests were for the 8moves book. I think I will do a similar test for the 2moves book (only STC and LTC). I hope it is ok.

EDIT2. Removing the 17% of "Speculative LTC" tests (by filtering on the descriptions) raises the average elo of patches submitted to LTC a bit, but not spectacularly. The discrepancy with the model remains. Next thing to try is the nTries model by @Vondele (people submit several trivial variations of the same patch until one passes STC).

EDIT3. There appear to be no scaling issues for logistic elo for the 2moves book.

http://tests.stockfishchess.org/tests/view/5c1aa9610ebc5902ba127aed http://tests.stockfishchess.org/tests/view/5c1ac93d0ebc5902ba127eba

(I checked that the scaling happens to be perfect for "normalized elo" http://hardy.uhasselt.be/Toga/normalized_elo.pdf but that is a different discussion).

mcostalba commented 5 years ago

@NKONSTANTAKIS we should open a PR against Fishtest repo.

Vizvezdenec commented 5 years ago

@mcostalba so new bounds should be [0.5; 4.5] and [0; 3.5]?

mcostalba commented 5 years ago

I guess so, if someone does not anticipate me, I will open a PR around Christmas time

FauziAkram commented 5 years ago

Please someone anticipate him :D By the way, what will the bounds for tuning patches become?

Alayan-stk-2 commented 5 years ago

Tuning patches have not been discussed here, but it wouldn't make sense to have harder to pass bounds than now or than the new bounds for more complex patches.

[0, 4] is easier to pass than [0.5, 4.5] already, and the somewhat higher probability of a patch being only a very minor gain isn't a concern for value tuning. But LTC should be dropped to [0, 3.5] at least.

NKONSTANTAKIS commented 5 years ago

Since we will use [0,3.5] LTC for code adders, I think its not logical to use exactly the same for tuning. [0,3] LTC seems an easy choice to make, safe and conservative. For STC tuning I would recommend [0.5,4] as a safe and conservative option. This way we keep an analogy between coder adders and tuning, like in the past: Old ones same lower bound and -1 upper, new ones same lower bound and -0.5 upper.

We have said to make one step at a time, but let it be a full step not some sloppy change. Indeed it makes no sense to use at the same time harder bounds for tuning than for adding code, it has to move along, at least at the slightest possible.

@mcostalba @snicolet What do you think?

mcostalba commented 5 years ago

@NKONSTANTAKIS I think that the discussion on possible new tuning patch bounds, currently SPRT[0, 4] should proceed in the same way of the previous one.

Namely, to post the simulation tables using @vondele / @vdbergh tool and quantitatively comparing current specs against proposed ones for the people to see and evaluate.

vdbergh commented 5 years ago

I have now checked that the nTries model (nTries is the number of trivial variations of the same patch) by @vondele can resolve the apparent contradiction between the expected and observed elo of patches submitted to LTC which I reported on above.

Using the data of 199 LTC tests (manually filtered for speculative LTC) the mean unbiased estimator (MUE) for elo of SPRT test gives an estimate for the average elo of patches submitted to LTC of 0.31. The variance of the MUE (as measure from the sample) is 1.67 which yields a 95% confidence interval [0.08,0.54] for the expected elo.

With nTries=1 the model (with normal elo prior with mu=-1.148511, sigma=1.161688) predicts the expected elo of patches submitted to LTC to be 0.66 which is outside the above confidence interval (this number does not appear to be very sensitive to the prior). However repeating the calculation with nTries=5 we find the expected elo to be 0.37 which fits comfortably in the confidence interval.

So at this point it seems more reasonable to me to use a model with nTries>1. Sadly there is probably not enough data to determine the "best" value for nTries.

EDIT. My estimate of the confidence interval was incorrect (I had neglected the variance of the underlying elo distribution). However we can simply measure the standard deviation from the sample. I have edited the above text. The confidence interval is now a bit wider but not enough to make nTries=1 work.

EDIT2: Inspecting the descriptions of the LTC tests manually it appears there are many more speculative LTC tests than those that I had filtered out mechanically. The final list contains only 199 regular entries out of the original 282. I also failed to account for the conversion STC elo->LTC elo. I am using the normalized elo model for this (multiplication by sqrt((1-d_LTC)/(1-d_STC)) ) but I do not have a lot of empirical evidence that this model is reliable. Anyway I have updated the text. With all the adjustments the "contradiction" in the case of nTries=1 becomes less and less severe and it could be just a case of bad luck.

EDIT3: Scaling tests (see below) indicate that the normalized elo model is not suitable for predicting scaling STC->LTC. The correct scaling model is anyone's guess but for now just assuming 1:1 does not contradict the data. I have once again updated the above text using this assumption. The contradiction for nTries=1 now becomes more severe again. Note that assuming an average scaling ratio of 1:1 says nothing about possible variation in scaling so the model is incomplete in this respect.

vondele commented 5 years ago

interesting ... yes, I've picked nTries=5 in the first version of that model, since that's roughly what I observe on submitted patches. It certainly is a reasonable guess. I guess there is no single optimal version of the number it correlates probably a bit with the author name ;-)

vdbergh commented 5 years ago

Here is the result of some scaling tests with the 2moves book. 40000 games each.

sf7->sf8 sf8->s9 sf9->sf10
elo STC 95.91 +-2.3 58.28 +-2.3 71.03 +-2.4
elo LTC 100.40 +-2.1 68.55 +-2.1 65.55 +-2.2

So we see that the common wisdom that increased TC causes elo compression (I also believed this) is not always true (look at the columns sf7->sf8 and sf8->sf9). One possible model seems to be to assume that there is a lot of variation in scalability of individual patches, so much that it totally dominates the average behaviour. Unfortunately it is impossible to quantify this as at this point the available data is totally insufficient. Note that @NKONSTANTAKIS points out below that sf8->sf9 may be an outlier since sf9 uses contempt and sf8 does not. I will test this theory by creating a contemptless version of sf9 and redo the test.

The tests are here.

http://tests.stockfishchess.org/tests/view/5c1fc1540ebc5902ba12b7c6 http://tests.stockfishchess.org/tests/view/5c1fc01a0ebc5902ba12b790 http://tests.stockfishchess.org/tests/view/5c1e0f050ebc5902ba129abc http://tests.stockfishchess.org/tests/view/5c1e0e570ebc5902ba129aaa http://tests.stockfishchess.org/tests/view/5c1ac93d0ebc5902ba127eba http://tests.stockfishchess.org/tests/view/5c1aa9610ebc5902ba127aed

NKONSTANTAKIS commented 5 years ago

Its crazy how the scaling trend is in opposite direction and at such elo amounts. I have always been a believer of scaling abnormalities but I would expect both LTC's to shrink the elo difference. My explanation (and this is justified by early tests) is that contempt scales well, or another way to put it is "contempt is much more riskier at very short TC's". This makes total sense since at low depths you can hardly get a foothold in draw so neglecting depth at drawish evals backfires more. Hence contemptless SF8 closes the gap at STC.

A way to verify this is to do sf8 vs sf10 tests and compare STC to LTC. This test could provide useful information about future actions. A smaller STC gap would indicate that my observation is accurate, while a smaller LTC gap would indicate that SF10 has somewhat worse scaling. If the latter is true, then we should probably consider ways to accept more positive scalers and reject more negative ones.

vdbergh commented 5 years ago

@NKONSTANTAKIS Another test of your theory would be sf7->sf8. Since these are both contemptless they should again exhibit the expected scaling behaviour.

NKONSTANTAKIS commented 5 years ago

@vdbergh Indeed, and the expected elo gap there will be smaller at LTC. But it would be contemptless vs contemptless which could behave to LTC in different analogy than contempt vs contempt to LTC, probably with increased drawrate and compression. Would be interesting. Through the SF8->SF10 we would also be comparing the relative scaling behavior of SF10 to SF9 to the same opponent. In general I think its a good idea to try and net in the positive scalers, mostly by not cutting them out at harsh STC.

Regarding the probable backfiring of contempt at low times I opened a thread: https://groups.google.com/forum/?fromgroups&pli=1#!topic/fishcooking/r_Nv8J34VoA

mcostalba commented 5 years ago

I have opened the PR: https://github.com/glinscott/fishtest/pull/342

Vizvezdenec commented 5 years ago

@NKONSTANTAKIS I replied to it - we have quite a lot of contempt data actually which I measured in fishtest idle times ;)

vondele commented 5 years ago

@vdbergh your elo with TC data has been added to the wiki here:

https://github.com/glinscott/fishtest/wiki/UsefulData

noobpwnftw commented 5 years ago

I'm curious about how many games would be needed to conclude this test: http://tests.stockfishchess.org/tests/view/5c210ab60ebc5902ba12dc20

If the trend is to have single patches providing lesser elo gain on average, I wonder if this would happen quite often at LTC: each of them taking 4x more resources than regular regression tests and still growing.

MichaelB7 commented 5 years ago

Looks like we will need to tune the process - my thought is that is to cap it at x games. Not sure what the right number - but somewhere near 150,000. After that , the Elo gain would not be worth the resources. Just my $.02. Others may and will differ.

SFfan876 commented 5 years ago

with old bounds this test will already red, because wide bounds require less games on average

vondele commented 5 years ago

The test is exactly doing what is expected. Somewhat smaller Elo gaining patches (~1Elo) have a chance to pass, and patches that are close to that bound will need significantly more games to be resolved. Overall, if all models discussed in depth above are relevant, we'll need on average 50% more CPU time per submitted patch... which is will presumably just imply people will submit a little less patches. The additional gain we have is a more precise elo estimate of all patches, which could help direct ideas and tests.

noobpwnftw commented 5 years ago

The point is that we may have more than enough confidence to conclude that this patch is a > +0.5 elo gainer, yet it might still fail to pass in the end.

Alayan-stk-2 commented 5 years ago

The patch is almost certainly an elo-gainer at STC and LTC - the STC passed twice quite easily (once with old bounds, once with new bounds), while the LTC is struggling but clearly positive. But there can be a scaling concern : does the gain hold at longer TC considering the STC>LTC trend ?

noobpwnftw commented 5 years ago

@Alayan-stk-2 If scaling is of concern, why are people using a smaller bound for LTC? It implies that now it is even more easier to pass a patch at LTC than before compared to STC changes.

I wonder what can we get out of a same patch running the equivalent of 7 regression tests, maybe a max number of games like 160k should be applied to prevent future unlucky runs like this.

vdbergh commented 5 years ago

I think we need a couple of months to judge the new scheme.

People have been calling for truncated SPRTs since forever but they are mathematically a bit more complicated (enough tools exist to handle them though, see e.g. http://hardy.uhasselt.be/Toga/wald.py, with a bit of hacking one can get it also out of http://hardy.uhasselt.be/Toga/sprt_elo.py).

When I was working on GnuChess I used the 2-SPRT (see https://projecteuclid.org/euclid.aos/1176343407) which is optimal for elo=(elo0+elo1)/2 (recall that the SPRT is optimal for elo in {elo0,elo1}). Like the truncated SPRT the 2-SPRT has the property that the number of games is bounded.

Here is a tool that can design and analyze 2-SPRTs http://hardy.uhasselt.be/Toga/sprt2.py.

Example

$ python 2sprt.py --help

This script computes the continuation region for the triangular
approximation to a 2-SPRT sequential test. It also computes
the corresponding approximate probability of rejecting H0 (OC)
and the expected running time (ASN).  Finally it can also compute 
the (OC,ASN) exactly (very slow) or by simulation (slow). 

A game is scored as 1,1/2,0 depending on whether it is a win, draw or loss.
If the upper bound of the continuation region is reached, H1 is 
accepted. If the lower bound is reached then H0 is accepted.

  --alpha       probability of making the wrong decision under H0,H1
  --elo0        elo corresponding to H0
  --elo1        elo corresponding to H1
  --draw_elo    draw parameter from BayesElo model
  --expected    compute the approximate expected running time of the test (ASN)
  --power       compute the approximate probability of rejecting H0 (OC)
  --exact       compute the OC and ASN exactly
  --simulate    compute the OC and ASN by simulation
  --sample      sample size for simulation
  --real_elo    elo at which to compute the OC and ASN 
  --help        print this message

$ python 2sprt.py --elo0 0 --elo1 3.5 --draw_elo 282

Test parameters:
alpha=0.05
beta=0.05
elo0=0.00
elo1=3.50
draw_elo=282

Continuation region:
1.00207932*N-136.827688 < S < 1.00069311*N+136.827688

Worst case expected running time: 83049
Maximal running time: 197413
vdbergh commented 5 years ago

In fact the 2-SPRT may be more suitable for LTC with the new parameters. The model (with nTries=5) predicts that with the new parameters the expected elo of a test submitted to LTC is 0.67 (Bayes Elo= 1.2). If we ignore variation and assume it is exactly 0.67 then the 2-SPRT(0,3.5) is more efficient than the SPRT(0,3.5) (I have not done the calculation taking into account variation). With nTries=1 it would be even better.

Vizvezdenec commented 5 years ago

I'm not sure this patch is an elo gainer at all tbh. Sure, it passed STC twice with different parameters but STC with intermediate and bigger parameter (and bigger parameter means just less difference thus elo positivity should be removed with it) failed and pretty fast, also 1 LTC failed pretty fast. Maybe it actually was a fluke and patch is elo neutral or like 0,2 elo worth.

ghost commented 5 years ago

@Vizvezdenec There is a problem: many patches will prove out to be elo-neutral or slightly positive/negative(near-yellow). This is my understanding of the problem:

1.Move time variability creates random "deep moves" that sometimes prove to be critical(amplified by large base time allocation for time managment) or conversely steal time from "average moves" to lower ELO. i.e. time is extremely unequally distributed creating random wins/losses. This is the primary random factor.

2.The opening book is fairly unbalanced and not that good for testing engines: The initial idea of making lots of unbalanced positions to detect flaws in eval isn't working anymore, just measures noise(forced wins/draws) since both engines are smart enough to know what position is a forced win/draw is in most cases, unless the patch is particularly bad(though it would lose in more neutral positions as well).

3.The hash table pressure of having only 4MB for STC also introduces a random factor(there is a PR for fishtest to fix this, though i'd prefer 16MB hash vs 8MB for patches(and especially patches that need depth or involve pruning changes)). STC reputation for unreliability needs to be gone, or people will not consider STC a true filter.

4.The usefullness of most positions: The actual percent of useful positions that detect subtle flaws in eval/pruning is not that large and the SPRT is dependent on them getting randomly selected from the book to create meaningful results: however, there if there are lots of forced win positions the speed at which SPRT converges to bound is reduced. Forced draws reduce the convergence slightly, but more importantly they prolong tests(LTC in particular is very drawish). This synergizes with random move times to makes the weaker engine randomly win a few games and break the SPRT gain streaks.

5.The machine factor: Since fishtest doesn't run on ideal "standard" computers, the engine time will be affected by CPU speed and thread scheduling.

A.Another task/thread may occupy the CPU shared with an engine to have an impact on its performance, giving the impression that a stronger engine is weaker due time lost at scheduling and other threads there will be a random factor of reduced performance. cutechess-cli timings and resource use for managing games may also have an underrated factor which changes actual speed.

B.The core CPU speed creates a limit to search depth, which makes some machines search deeper than others and consequently provide more accurate eval. The patches which win at speed with slower CPUs start losing at depth once on faster CPU because both engines reach similar higher depths(e.g. mid-20). This means "speed gain"/"simplification" patches will look better on slower CPUs but revert to neutral score once run on faster CPU(equivalent to extending time control),which is also why no LTC tests of speed patches are run.

i.e. Fishtest actually doesn't run on "time controls", it runs on different nodes/second speeds dependent on CPU speed: even consideting the bulk of fishtest is near 1.5mnps ChessDB cpus they too demonstrate speed differences leading to different search depth/"time controls". This is particularly noticeable at LTC, since any speed variability drastically changes the allocated nodes per move. Example: Machine A searches 1400knodes/second Machine B searches 1600knodes/second For each 100ms(STC increment) move the B machine searches 20knodes deeper. For each 600ms(LTC increment) move the B machine searches 120knodes deeper. For a 1200ms move the B machine will search 240knodes deeper.

  1. Depth factor: In effect this means game results are coming in from different depths.Thats wouldn't be that bad if depths were equally allocated, but depths are dependent on which machines are allocated for the test how the test changes pruning/eval and which book positions are randomly selected(due some positions having critical depths at which the PV changes: faster CPUs and good time management uncover these positions).

Same book position searched slightly deeper(faster CPU, more time allocated due different eval) suddenly change results for nearly identical engines: This is not due ELO gain, its purely speed difference and timing allowing the luckier engine to win. Edit:disregard the "depth factor" rant, i've checked the fishtest code it fixes it by changing time control to account for NPS of the machine.

Near neutral patches slightly changing eval/search will make time allocation for identical book positions different because the PV is constructed from different searches/depths(especially at STC, where pruning is critical and speed at which the PV depth is reached is very important), this turns the test into "book tuning" competition, where patches rely on lucky machine allocation and book positions to pass STC.

  1. Neutral patch problem: A neutral patch will waste resources because the above random factors will create illusions of meaninful ELO changes but in effect these are random factors at work. These illusory "timing wins" being dominant in neutral patches force SPRT to fluctuate and not converge, wasting resources and prolonging tests(especially noticeable with new bounds).
vdbergh commented 5 years ago

@Chess13234

There is a problem: many patches will prove out to be elo-neutral or slightly positive/negative(near-yellow).

You could have stopped right there. Through the use of statistics (see above) we now know objectively and independently of the testing procedure that the patches that gain enough elo to be detectable form a very small fraction of all patches submitted.

So yes the 'problem' you mention is there. And it is only solvable by throwing large amounts of resources at it. Nothing else will help.

ghost commented 5 years ago

@vdbergh Something should be done with increasing the data quality for SPRT and not brute-force solutions like adding more time/hardware. Statistical probabilities are not objective facts and only represent a chance, and of course are dependent on testing procedures, bounds and testing conditions: you will have wildly different results with other time controls and especially other books(fishtest is essentially tuning Stockfish towards performance in 2moves book). Ultimately of course there are factors beyond our control such as caches misses and branch prediction misfiring influencing timings, but there are factors that can be improved.

vdbergh commented 5 years ago

@Chess13234

Statistical probabilities are not objective facts and only represent a chance, and of course are dependent on testing procedures, bounds and testing conditions: you will have wildly different results with other time controls and especially other books(fishtest is essentially tuning Stockfish towards performance in 2moves book).

SPRT tests give indeed a very distorted view of the world and the distortion depends strongly on the chosen bounds. But the good news is that this distortion can be understood mathematically. So if you have enough data you can remove the distortion and look at the underlying elo distribution (for a fixed book and time control). This is precisely what was done. So yes, the conclusion of the statistical analysis is independent of the testing procedures that have been employed on fishtest.

As to scaling to different time controls. So far there is nothing to assume that SF on average scales badly. Indeed the tests are compatible with the assumption that the scaling ratio is simply 1 from STC to LTC which is still a 6 fold increase in TC.

As to books. If you want to claim that the 2moves book is bad in some way then you should give arguments (not hand waving) to back that up. The 2moves book was tested by various people and it turned out to be just as good or better than other books (the suitability of books for engine testing can be measured objectively, see here http://hardy.uhasselt.be/Toga/normalized_elo.pdf).

noobpwnftw commented 5 years ago

Actually, the better question is how much accuracy do you really need with LTC for those sightly <1 elo patches, SPRT or not. So far, I see the demonstration of those worst case scenarios on a daily basis, let's see if it continues.

ghost commented 5 years ago

@vdbergh

  1. SPRT doesn't magically "remove distortion", it reduces distortion as long as the data is somewhat reliable. If the distortion is significant enough, which it is in neutral patches, it will measure the random win/loss probabilities.

  2. scaling to different time controls: this is easy to observe as most STC passing patches don't scale to LTC. This means all these engine variations behave differently at different TC. I.e. any minor variation of Stockfish is highly likely to behave differently in STC vs LTC.

3.". If you want to claim that the 2moves book is bad in some way then you should give arguments" see https://groups.google.com/forum/#!topic/fishcooking/oljs6EvQ6Iw https://github.com/official-stockfish/Stockfish/issues/1853

noobpwnftw commented 5 years ago

As a part of improvement, what are the chances of those passing tests having also [-3, 1] passing to remove them?

vdbergh commented 5 years ago

@vdbergh

  1. SPRT doesn't magically "remove distortion", it reduces distortion as long as the data is somewhat reliable. If the distortion is significant enough, which it is in neutral patches, it will measure the random win/loss probabilities.

Obviously you are making no attempt whatsoever to understand what I write.

  1. scaling to different time controls: this is easy to observe as most STC passing patches don't scale to LTC. This means all these engine variations behave differently at different TC. I.e. any minor variation of Stockfish is highly likely to behave differently in STC vs LTC.

Actually this is very difficult to observe for individual patches (measuring a 1 elo difference takes 160000 games, for multiple measurements one needs many more games). But what one can at least do is to test the average behavior (see above). And these tests shows that at least on average there are no scaling problems.

3.". If you want to claim that the 2moves book is bad in some way then you should give arguments" see https://groups.google.com/forum/#!topic/fishcooking/oljs6EvQ6Iw

1853

AFAICS these threads don't mention any tests that show that the resolution of the "cleaned book" is actually better than the original one.

ghost commented 5 years ago

@vdbergh 1.I mean that probabilities derived from SPRT are based on purely random factors in neutral patches. It doesn't make sense to ignore these random factors(mentioned earlier) of distortion, because they make SPRT meaningless and random game of chance, with only significant elo gain/loss patches being quickly resolved. I'm not critiquing SPRT itself, but the testing procedure.

  1. " And these tests shows that at least on average there are no scaling problems." Greens are dominated by STC tests which don't scale. http://tests.stockfishchess.org/tests?success_only=1 3.Its better to check some other books, like drawkiller one. I don't think a new 2moves version with <0.1% of positions changed would have much effect(unless testing period is huge to cover these <0.1% positions enough).
Alayan-stk-2 commented 5 years ago
  1. " And these tests shows that at least on average there are no scaling problems." Greens are dominated by STC tests which don't scale. http://tests.stockfishchess.org/tests?success_only=1

Even if the random probability of a neutral patch passing is low, when you have a lot of tests, you'll get a big part of passing patches which are just flukes. This isn't unexpected by itself, though ofc the more noise the worse it is.

3.". If you want to claim that the 2moves book is bad in some way then you should give arguments" see https://groups.google.com/forum/#!topic/fishcooking/oljs6EvQ6Iw

1853

AFAICS these threads don't mention any tests that show that the resolution of the "cleaned book" is actually better than the original one.

The issue 1853 mention no such test (and considering it wants to remove 37 positions out of many thousands, it would be impossible to measure), but e.g. a +1.5 starting position is definitely creating signal noise.

Also, I know that the openings used in fishtest are selected randomly, I assume they are always played with reversed colors ?

ghost commented 5 years ago

@Alayan-stk-2 If they wouldn't be repeated with opposite colors thats would be very unfair, even if perfectly neutral positions are chosen. Here is the relevant code(see -repeat): https://github.com/glinscott/fishtest/blob/master/worker/games.py#L474

Edit: I noticed that far more disturbing that fishtest uses draw adjudication at 20 centipawns from zero at 8 consecutive moves!!! How is this even allowed? Edit: this prrameter problem might be more important than a book. It also resigns at -400 centipawns in three consecutive moves... fishtest is deeply flawed. Added a separate issue for adjudication here;

https://github.com/official-stockfish/Stockfish/issues/1904

vdbergh commented 5 years ago

@vondele

@vdbergh your elo with TC data has been added to the wiki here:

https://github.com/glinscott/fishtest/wiki/UsefulData

Thx! It occurred to me that while these tests are not incompatible with the hypothesis of a scaling ratio of 1:1, they are also not a lot of evidence for it because of selection bias.

If we accept the hypothesis that there are well scaling and badly scaling patches then the well scaling patches are more likely to pass LTC after passing STC, causing the LTC rating to be inflated. For SF this is a good thing of course but it messes up the stats.

The only way I see to measure scaling somewhat objectively is to create a sf_STC which is some version of sf with a number of patches applied which passed STC regardless of whether they passed LTC or not. However to get a statistically significant result in reasonable time the number of patches must be quite large which is difficult because of conflicts. And this would only tell us something about the average behavior. The variation is still another matter.

ghost commented 5 years ago

@vdbergh That actually sounds cool, something like "Blitzfish" tuned for short time control, but the patches should have at least +1elo impact(preferably with largest elo gains) to actually measure something(STC elo is very unreliable).

Alayan-stk-2 commented 5 years ago

The recent experience has not been very positive. We got the multi-cut patch passing properly, so elo gainer can absolutely pass, but the "small gain patches" (at +1 or +1.5 elo) have had a very hard time to clear STC even with flukes.

This experience at least confirms that SF's progress comes from small gainer, and that the idea that only +2 elo code adder are worth the complexity cost (which was behind the old bounds) is outdated in SF's current state. Going back to the old bounds would not make sense at all considering it.

But - and while the experiment has only lasted a few weeks so too short yet to reach definitive conclusions - it appears we can do better.

There is an unfortunate side effect of the new bounds which we overlooked during the discussions. With the old bounds, a patch gaining at STC/LTC 1/1.5 had the same chance to pass than a patch gaining 1.5/1+1 elo.

I just used the simulation tool to evaluate this situation. Now, the 1.5/1 case (bad scaler) has 19.4% chance of passing, compared to 13.87% for the 1/1.5 case (good scaler).

Of course, for individual patches the accuracy of SPRT is way too low to be able to properly know if a patch is a good scaler or not. However, in the long run over hundreds of patches, the way this will shape the prior distribution into the posterior distribution is quite clear, and it's in a way which harms SF's progress and undermines the expected benefits of the new bounds.

Alayan-stk-2 commented 5 years ago

To save on resources, increasing passing rates of patches around +1.5 elo, keep bad patches in check and favor good scaling, maybe a 3-staged testing could be designed. First, a patch would have to clear a VSTC designed as a cheap filter, then an easier STC than now (which would act as a second filter, of higher quality than VSTC but costing more), then a LTC.

Below is a table with the results of a simulation using such a 3-staged approach, with a 4+0.04 VSTC, compared to the old bounds, the new bounds, and 2 other proposals. The "STC" and "STC+LTC" costs include the VSTC costs.

Limits [0,5] + [0,5] [0.5, 4.5] + [0,3.5] [0.5, 4] + [0, 3.5] [0.4, 4.4] + [0, 3.2] [-1.0, 3.5] + [-0.5, 3.5] + [0, 3.5]
0 ELO pass prob 0.0025 0.00123 0.0011 0.0014 0.00081
1 ELO pass prob 0.0744 0.1002 0.1183 0.1276 0.1443
1.5 ELO pass prob 0.2460 0.3422 0.4185 0.3946 0.4667
2 ELO pass prob 0.5135 0.6438 0.7442 0.6852 0.7538
total ELO gain ratio 1.0 1.3192 1.5664 1.5523 1.7646
-0 ELO acceptance ratio 2.5 e-04 9.1e-05 7.77 e-05 1.0 e-04 4.7 e-05
Avg. STC cost 18431 24456 30936 25110 31590 (?)
Avg. STC + LTC Cost 27931 38039 46093 42990 33418

According to the simulation, this combines best elo/resource ratio ; best non-regression probability, best total expected elo gain, best pass probability in the [1, 2] elo range.

However, this setup does not completely address the scaling behavior, because it gives an additional weight to a very short TC. Let's take a 1.7/1.5/1.0 (vstc/stc/ltc) patch and a 0.8/1.0/1.5 one (so about the same as my previous example but with vstc added). The first one (1.7 vstc) gets 23,6% chance of passing while the second one (0.8 vstc) gets 13,5%. Of course, at equal vstc+stc+ltc elo sums, this setup favors better scaling behavior.

I used a modified version of vondele's tool to get the results. Below is the code I changed to compute pass probability and cost for the 3-staged system. It isn't very "clean", but it does the job.

vstc_lower = -1.0
vstc_upper = 3.5
stc_lower = -0.5
stc_upper = 3.5
ltc_lower =  0.
ltc_upper = 3.5

def vstc_prop(x):
    return sprt_pass_prob(x, 196.0, vstc_lower, vstc_upper, 0.05)

def cost_vstc_prop(x):
    return 0.4 * sprt_cost(x, 196.0, vstc_lower, vstc_upper, 0.05)

def stc_prop(x):
    return sprt_pass_prob(x, 226.0, stc_lower, stc_upper, 0.05)

def cost_stc_prop(x):
    return sprt_cost(x, 226.0, stc_lower, stc_upper, 0.05)

def ltc_prop(x):
    return sprt_pass_prob(x, 288.0, ltc_lower, ltc_upper, 0.05)

def cost_ltc_prop(x):
    return 6 * sprt_cost(x, 288.0, ltc_lower, ltc_upper, 0.05)

def combined_stc_ltc_prop(x, ntries):
    vstcPass = vstc_prop(x)
    stcPass = stc_prop(x)
    # for i in range(ntries):
    #    stcPass = stcPass + stc_prop(x) * (1 - stc_prop(x))**i
    return ltc_prop(x) * stcPass * vstcPass

def cost_combined_stc_ltc_prop(x, ntries):
    vstcPass = vstc_prop(x)
    stcPass = stc_prop(x) * vstc_prop(x)
    #for i in range(ntries):
    #    stcPass = stcPass + stc_prop(x) * (1 - stc_prop(x))**i
    cost_vstcPass = cost_vstc_prop(x)
    cost_stcPass = cost_stc_prop(x) * vstcPass + cost_vstcPass
    #for i in range(ntries):
    #    cost_stcPass = cost_stcPass + cost_stc_prop(x) * (1 - stc_prop(x))**i
    return cost_ltc_prop(x) * stcPass + cost_stcPass
xoto10 commented 5 years ago

I agree that small gainers seem to be rejected at STC now (as before) where we would like to see more of the 1.0 - 1.5 Elo patches getting passed. Also agree that the current bounds appear to discourage patches that scale well. For example, I'm interested in Stephane's current "windmill" tests because I would have thought that was more of a search thing rather than eval ... and the current LTC does seem to be performing worse than the STC but could still pass. Now obviously Stephane is probably right and I am probably wrong regarding these particular tests (and the STC pass was strong), but it does seem to be an illustration of the potential issue.

Not sure what the solution is, but it seems to me that we're maybe over-complicating it. Perhaps a simple change from [0,5] to [0,4] for both STC and LTC is the way forward? If we do use different bounds for STC and LTC, I would be more comfortable if the average (or lower?) bound was higher for LTC than STC because of the possible impact on scaling of the average merged patch.

Alayan-stk-2 commented 5 years ago

Here is a new table with [0, 4] [0, 4] for reference.

Limits [0,5] + [0,5] [0,4] + [0,4] [0.5, 4.5] + [0,3.5] [0.4, 4.4] + [0, 3.2] [-1.0, 3.5] + [-0.5, 3.5] + [0, 3.5]
0 ELO pass prob 0.0025 0.0025  0.00123 0.0014 0.00081
1 ELO pass prob 0.0744 0.1433   0.1002 0.1276 0.1443
1.5 ELO pass prob 0.2460 0.4446 0.3422 0.3946 0.4667
2 ELO pass prob 0.5135 0.7477 0.6438 0.6852 0.7538
total ELO gain ratio 1.0 1.7497  1.3192 1.5523 1.7646
-0 ELO acceptance ratio 2.5 e-04 2.0 e-04 9.1e-05 1.0 e-04 4.7 e-05
Avg. STC cost 18431 27886   24456 25110 31590 (?)
Avg. STC + LTC Cost 27931 46381  38039 42990 33418

[0, 4] + [0, 4] performs actually quite well to get more patches passing, but at the cost of a big resource usage hit that we're trying to limit and without increasing confidence that the patch is better than neutral.

noobpwnftw commented 5 years ago

STC tests are meant to be fast and not accurate. We used to have spare time to run speculative LTCs if some potential good ones did not pass STC, and several patches had passed this way. Now we no longer have extra resources to do that, there is one test waiting for 3 days and not yet getting many games done. In return, we now get a long list of 20k failing STCs just waiting for more games to conclude their SPRT. I don't think there are good chances that the tides can turn for them, even if they eventually turned into some 90k yellows, they'd still need to fight the chance for a speculative LTC run.

Alayan-stk-2 commented 5 years ago

STC tests are meant to be fast and not accurate.

This is definitely true.

However if they are too inaccurate either we miss on too many good patches, or we fill up LTC with poor patches to test, so finding the best balance is not trivial. That's where the VSTC idea above come from, a really cheap filter with really poor accuracy, but good enough to reject a big proportion of useless patches, and allow a somewhat costlier STC filter before investing on LTC.