official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
11.11k stars 2.23k forks source link

SPRT parameters improvement #1859

Closed mcostalba closed 5 years ago

mcostalba commented 5 years ago

After a very long and interesting discussion https://github.com/official-stockfish/Stockfish/pull/1804#issuecomment-445429885 I think it's time to summarize in quantitative way what people have found and proceed in a practical / executive fashion.

The topic is the improvement of SPRT parameters aimed at:

It is immediately clear that the 2 goals go in opposite directions, anyhow from the discussion it seems there is some consensus to allow a reasonable increase of test resources in exchange for stricter parameters.

To allow the discussion to be quantitative and on the point, I'd like people post here only simulation results in a simple and standardized format:

Limits SPRT[0, 5] STC + SPRT[0, 5] LTC
0 ELO pass prob xxx
1 ELO pass prob xxx
<= 0 ELO pass prob xxx
>= 1 ELO fail prob xxx
Avg. STC cost xxx
Avg. STC + LTC Cost xxx

The first 2 records give a taste of the sensibility of the scheme, the 3 and 4 give a taste of the failure rate of the scheme and the last 2 of the cost of the scheme.

Now, on the cost. Patch's ELO is not uniformly distributed, namely the biggest part of tested patches are within [-2, 0] ELO (neutral or slightly negative). We have to consider this to compute a standardized cost.

I propose the following. Given:

ag(x) = Average number of games for pacthes with x ELO

We define the cost as:

STC Cost = 10 ag(-2) + 35 ag(-1) + 40 ag(0) + 25 ag(1) + 5 * ag(2)

For the STC + LTC cost we have to consider that:

LTC Cost = STC Cost * 6

Given that all the above is not trivial to implement, I would split this project in 2 phases;

  1. First find a standard tool to reliably compute all the above. Publish it and let people refer to the same measuring tool.

  2. Once we have an open, shared and validated tool then compute the different schemes.

The standard tool should be sound. I would prefer a tool based on simulations more than on formulas, it sounds to me simpler and easier to review for a wider audience of people, and also more flexible. Everybody is encouraged to submit their script / simulator. Once the process has stabilized on a shared and validated single tool, then we will proceed to the second phase.

It won't be quick, but this is an engineering approach to the problem: solid, slow and boring :-)

vondele commented 5 years ago

@mcostalba did you have a look at the actual tool I made available https://github.com/official-stockfish/Stockfish/pull/1804#issuecomment-442254913 ? Just click the mybinder icon, and have a look.

yes, it took about two weeks to write, so at least the slow bit is there ;-)

mcostalba commented 5 years ago

@vondele Great job! The tool is here: http://nbviewer.jupyter.org/github/vondele/Stockfish/blob/tools/tools/Fishtest_SPRT_opimization.ipynb

Some questions:

Yery commented 5 years ago

In addition to changing the bounds, it would probably also make sense to increase the time control of the STC to make the difference between STC and LTC smaller, and to reduce the need and surprises with "speculative LTCs". I would propose 20 + 0.2.

Alayan-stk-2 commented 5 years ago

@mcostalba Having done tweaks and bound testing with vondele's tool, I think I can give decent answers to your questions :

Also notice that the tool focus on the cost of 5xSTC + LTC more than STC + LTC (LTC only cost resources if the STC pass, the tool accounts for this) ; because it is observed in practice that several similar variations of a tweak often get submitted to fishtest. But this can be changed easily.

@Yery While longer TC for testing is attractive in reducing low-depth flukes, it is also more costly in resources. As much as I'd like fishtest to use longer STC and LTC, we have to be careful with this. A ratio of only 3 between STC (which is supposed to act first as a filter) and LTC is likely not a good compromise between resource use and result quality. STC 15/LTC 75 would likely consume similar or smaller resources than 20/60 with better overall results.

While discussing TC increase fits here (the optimal bounds depend directly on the time ratio between STC and LTC), the increased precision which is the main focus here already will increase resource usage. Adding another multiplicative resource usage increase on top of it may be too much.

mcostalba commented 5 years ago

@Yery thanks for your comment but I'd like firstly to land on a validated and shared tool, one we have it we can do all the discussion we want, but for the moment, to avoiding lacking of focus, please let's stick on the tool implementation details.

@Alayan-stk-2 for the real distribution I would suggest a real array of values (we just need less then 10 of them) that correspond to the real distribution. Easy and explicit.

@vondele could you please copy / paste your tool and simplify it down to just compute the values of a single SPRT combination? We will call it the official simulation tool, and it should be simple: no need to compute both new and old SPRT parameters, just one set, whose results we will report here.

Alayan-stk-2 commented 5 years ago

@Alayan-stk-2 for the real distribution I would suggest a real array of values (we just need less then 10 of them) that correspond to the real distribution. Easy and explicit.

Vondele's tool work using continuous functions at many places. Turning it into the evaluation of a few discrete elo patch value would likely be more work for a worse result.

The array idea based on observed fishtest data is still applicable, however. We can get a table with the values hardcoded, then do a linear interpolation between the two closest value in the table.

Trivial example in pseudocode to get the value of the patch elo probability function at x :

array[] = {0.1,0.15,0.08} // values at -1, 0, 1 elo

if (x <  -1 || x > 1)
    return 0

x++
for (i=0;i<3;i++)
{
    if (x > i)
    {
        return ((x-i)*array[i] + (i+1-x)*array[i+1])
    }
}
vondele commented 5 years ago

@mcostalba

let me have a look, the table can be 'easily' computed with the notebook.

your: STC Cost = 10 ag(-2) + 35 ag(-1) + 40 ag(0) + 25 ag(1) + 5 * ag(2) is quite close to a slightly different Gaussian model (mu = -0.21 sigma = 1.112) of patch distributions, which I will use (it is quite realistic, even if optimistic). It is easier to work with distributions.

ghost commented 5 years ago

What are arguments against tighter time controls like 1+0.2 and 2+1? They should in theory use less time for shorter games and have less variation in searched depth(more reliable results for SPRT). https://github.com/official-stockfish/Stockfish/pull/1804#issuecomment-445472297

ghost commented 5 years ago

I'm currently running this test on low throughput to see how variable win-loss records lead to SPRT distortion(random depths searched due time slice in 10+0.1 being widely different) http://tests.stockfishchess.org/tests/view/5c0c43c10ebc5902bceed541 and its comments here: https://github.com/Chess13234/Stockfish/commit/23c7a4101afd6e407ced0b2e92ecd5002cc19434

mcostalba commented 5 years ago

@vondele thanks. Actually my numbers are just out of my hat. I'd guess the best ones are the histograms you plotted in the "ELO estimates of STC tests" graph, in case you got them from real tests (indeed I don't know the source of them).

mcostalba commented 5 years ago

@vondele I am not an expert in statistic but maybe we could find a distribution that better resembles the data, for instance a skewed one: https://en.wikipedia.org/wiki/Skewness

vondele commented 5 years ago

@mcostalba yes, I got them from the fishtest analysis (elo estimates of the past X-thousand tests or so). However, there is no need to be overly concerned to fit that data (it is easy), the part below zero matters little (it is quickly all rejected with the testing schemes we have, most resources go to positive part of the elo spectrum).

The most important aspect is that the distribution of patches tested on fishtest is not absolute, but depends on the test scheme anyway (i.e. unavoidably, patches close to passing will be tweaked and resubmitted). Therefore the notebook is just testing reasonable assumptions and can guide a policy decision, but can't predict the output of a bounds change (that's more a social experiment than natural science).

I have the notebook with the table implemented in a couple of minutes, but will avoid the simplification of it, that's more work, and as you mentioned the graphs are IMO more informative than the table.

vondele commented 5 years ago

I've updated the notebook, at the end there is the table Marco asked for. (the links posted previously should still work, but note that these are cached for a few minutes) Feel free to mention possible errors in the notebook.

Now, the table for a couple of bounds is below. (row three and for slightly redefined, and all based on 1 STC try, and Marco's patch distribution)

Limits [0, 5] STC + [0, 5] LTC [1,4] + [0,3] [0.5, 4] + [0, 3]
0 ELO pass prob 0.0025 0.00036 0.001
1 ELO pass prob 0.0743 0.08051 0.145
+1 ELO rejectance ratio 0.705 0.613 0.532
-0 ELO acceptance ratio 0.0004 3.5e-05 0.0001
Avg. STC cost 23622 47822 41507
Avg. STC + LTC Cost 47505 80060 90712
vdbergh commented 5 years ago

@vondele How did you compute the elo estimates? I am asking since there are basically three options: the MLE one (which is the naive one), the median unbiased estimator (which is the one currently used and which is too high/low in 50% of the cases) and the mean unbiased estimator (whose expectation value is equal to the true elo value). For a fixed length test these three estimators coincide but they are (quite) different for SPRT tests.

The median biased estimator (which you proposed!) is intuitively quite satisfying and easy to compute. However for further statistical processing a mean unbiased estimator is in principle more desirable. I have a tool which computes the mean unbiased estimator for SPRT tests but I have not released it yet since currently it only works robustly for small elo differences. However it could already be useful in this case.

To have an idea of what I am talking about here are estimates for the expected values of the three estimators for an SPRT(0,5) with elo=-1 (they were obtained by simulation so they are not completely accurate).

mean=-0.99 (2.27) median=-1.52 (2.36) MLE=-2.04 (2.56)

(the values between parenthesis are the RMS errors between the estimator and the true value, tautologically the mean unbiased estimator has minimal RMS error).

Here is a sample for elo=2. Now the difference between the median/mean unbiased estimators is less dramatic.

mean=2.00 (2.00) median=2.15 (2.31) MLE=2.33 (2.66)

mcostalba commented 5 years ago

@vondele thanks. Interesting results.

I see that we double the cost in terms of Fishtest efficiency. This is not an easy choice.

For complex patches +1 ELO is really border line because we may not want to add a lot of complex stuff for a +1 ELO increase. This is just my point of view of course.

Also going from STC[0, 5] to STC[1, 4] doubles the STC requested resources, so maybe makes sense to keep STC[0, 5] and then going for a stricter windows for LTC.

@vondele could you please:

Indeed the +2 ELO patches are the threshold between good stuff and almost useless but burdening stuff. So I am curious of what happens there.

vondele commented 5 years ago

@vdbergh, the elo estimates are based on your script (i.e. sprt_elo.py simply run on all results), so the median unbiased estimator... IIUC. Note that the challenge is even more difficult than the estimation of the elo value of a given test. We would like to get the underlying distribution of all tests.

vondele commented 5 years ago

@mcostalba the notebook is there to do all kind of experimentation... there are unlimited possibilities. With the mybinder link, it will open in your browser, no need to install anything.

concerning 0.0743 yes converting to percent means 7.43 %.

BTW, I think there are essentially no 2 Elo patches that are still being found.

vdbergh commented 5 years ago

@vondele I agree it is an interesting challenge. Do you still have the raw data (or is there an easy way to get it)?

EDIT. With whatever estimator we use we can calculate the transformation

true elo distribution -> (expected) observed elo distribution

So the challenge is to invert this. I wrote earlier that it is a convolution but this is not true. But in any case it seems to be just linear algebra (since we can discretize the problem) which numpy can do very efficiently.

Less naive and probably more robust would be to compute the MLE for the elo distribution directly. I have not thought how hard this would be.

Alayan-stk-2 commented 5 years ago

@vondele thanks. Interesting results.

I see that we double the cost in terms of Fishtest efficiency. This is not an easy choice.

You have to divide the elo gains by the resource costs.

Also, as was pointed out in the other thread, the elo-losers have practically no importance when it comes to the elo impact of passing test, but they have an importance when it comes to resource usage.

The proposed new bounds would reject elo-losing patches as fast or faster than now, so the relative increase in total resource cost is smaller than in vondele's data in the table.

You can get about +50% elo passing for +80% resource usage (and good patches are the bottleneck much more than resources at fishtest currently, especially as resources sitting unused push people to try random things).

For complex patches +1 ELO is really border line because we may not want to add a lot of complex stuff for a +1 ELO increase. This is just my point of view of course.

This is certainly a valid reasoning, but what defines "complex patch" ? Is a one-liner a complex patch ?

I think it is a better approach to ask for more tests when a complex change gets submitted and passes testing (which is not often) than to make fail many simple good patches.

7,43% of +1 elo patches passing is quite low, and this also means a majority of +1.5 and +2 elo patches failing. In practice, we will often get multiple minor variations, which increase the likelihood to pass but also the resource usage at testing (which isn't more efficient at the end).

And as SF improves, elo gains get harder. Regression tests are showing that the average elo-gaining patch is much closer to +1 elo than to +3 elo (count the number of elo-gainer during SF10's development cycle, substract some for simplifications, and compare to the total elo gain).

This indicates that elo-gaining patches merged are very often +1/2 elo gainer which gets lucky during STC/LTC rather than outstanding +3/4 elo gainers. Having a lottery to decide which good patches get merged and which get rejected isn't great.

Also going from STC[0, 5] to STC[1, 4] doubles the STC requested resources, so maybe makes sense to keep STC[0, 5] and then going for a stricter windows for LTC.

See my points above. The elo/resource ratio goes down with the newer bounds proposal, but not as much as a division by 2, far from it.

* Compute also with the follwing cases: STC[0, 5] + LTC[1, 5] and STC[0, 5] + LTC[2, 5]

This will reject even more good patches all the while not improving resource usage (closer bounds usually means more, but higher means quicker rejection for patches below the lower bound so not sure which effect dominates). It's 100% guaranteed that those bounds would be awful.

vondele commented 5 years ago

@vdbergh https://github.com/vondele/Stockfish/blob/tools/tools/stc_tests_elo is the raw elo data for 6719 tests.

vdbergh commented 5 years ago

@vondele Thx. This is in principle already useful. But you did not keep the original (W,D,L) data?

EDIT. I guess I can collect this data myself but some indication how you did it might be useful (then I do not have to think).

vondele commented 5 years ago

here we go https://github.com/vondele/Stockfish/blob/tools/tools/stc_tests that was obtained from downloading all results and grepping for the useful part

vdbergh commented 5 years ago

@vondele Great! Thx.

EDIT: This is fascinating data. I did some preliminary processing. It appears that the average submitted patch is -2.5 elo (this should be a statistically correct result but it is probably heavily influenced by outliers). Next weekend or so I will try to get an estimate for the true underlying elo distribution (I am not yet sure if it is feasible).

xoto10 commented 5 years ago

BTW, I think there are essentially no 2 Elo patches that are still being found.

+1

I don't think there are many "complex" patches, either, although different people might have different views on what counts as "complex".

Vizvezdenec commented 5 years ago

well, we just had 2 elo gaining patches resulting in +4.4 elo in RT, so at least one of them is 2 elo :)

vondele commented 5 years ago

@Vizvezdenec ... hopefully, but don't forget the error bars on the RT. It is 4.4 +- 2

Vizvezdenec commented 5 years ago

I've just counted actually. We had 80 elo gainers in sf9->10 that resulted in 55 elo on 8 moves.pgn and probably slightly more than 60 on 2 moves.pgn. So, IMHO, borderline of garbage patch should be 1 elo or maybe even lower - since on average our elo gainers from latest times are passing with this elo or less.

mcostalba commented 5 years ago
Limits [0, 5] + [0, 5] [0.5, 4.5] + [0,3.5] [1,4] + [0,3] [0.5, 4] + [0, 3]
0 ELO pass prob 0.0024 0.0012 0.00036 0.001
1 ELO pass prob 0.0744 0.1229 0.08051 0.145
+2 ELO rejectance ratio 0.3101 0.2034 0.1423 0.1224
-0 ELO acceptance ratio 0.0004 0.0001 3.5e-05 0.0001
Avg. STC cost 23622 32401 47822 41507
Avg. STC + LTC Cost 47505 67451 80060 90712

I would say that STC[0.5, 4.5] + LTC[0,3.5] is a good compromise between resources used and performance. It increases average test time by a 42% (that is not small) but detects 4 good patches out of 5 instead of the currently 2 out of 3 (reject ratio 0.2034 vs 0.310).

Alayan-stk-2 commented 5 years ago

The tool is still using a gaussian curve to simulate elo patch distribution

I went ahead and used the array method for patch elo distribution :

def patch_elo_dist(x):
    #linear array from -5.25 to 5.25, every 0.5
    elo_array = [140, 150, 200, 230, 260, 300, 330, 410, 530, 570, 600, 680, 610, 490, 80, 45, 35, 30, 25, 20, 15, 10]
    z = (x+5.25)*2
    for i in range (0, 22):
        if (z > i and z <= i+1):
            return ((i+1-z)*elo_array[i] + (z-i)*elo_array[i+1])/5690

    return 0

The granularity is quite problematic around 2 elo, the result is still interesting.

Limits  [0,5] + [0,5] [0.5, 4.5] + [0,3.5] [1,4] + [0,3]  [0.5, 4] + [0, 3]
0 ELO pass prob  0.0025 0.00123 0.00036 0.0011
 1 ELO pass prob  0.0744 0.1002 0.0805 0.1451
+1 ELO rejectance ratio 0.6647 0.5972 0.5950 0.5207
+1.5 ELO rejectance ratio 0.3705 0.2866 0.2563 0.2116
 +2 ELO rejectance ratio  0.1589 0.0968 0.0620  0.0553
 total ELO gain ratio 1.0 1.1354 1.1436 1.2841
-0 ELO acceptance ratio  0.00022  8.1e-05 2.0e-0.5 0.0001
Avg. STC cost 10107 13799 20293 17595
 Avg. STC + LTC Cost  20332 28959 34261 38942

Note that while elo-loser acceptance ratio goes down, the number of elo-loser goes up with this distribution.

It appears that while most of the potential elo gain in submitted patch is in +1 elo patches, the bulk of the "easy to grab" elo gain is still in the +2 elo patches.

mcostalba commented 5 years ago

@Alayan-stk-2 thanks. It confirms [0.5, 4.5] + [0,3.5] as the most balanced IMHO.

On one hand with these new parameters we will have more green LTC, as is the main target behind this change.

I am a bit worried of +1 ELO big patches that not always are worth committing, but this is counterbalanced by simplifications that will eventually remove later a weak complex patch. Instead weak simple patch will stay.

Indeed the interplay between new patches and simplifications is a strong one. It is a kind of dialectic process, a live dynamic that allows to strengthen SF in the long term in a good way: it values more than the sum of the single parts.

Alayan-stk-2 commented 5 years ago

The previous suggested bounds got most of their increase from patches between 2 and 3 elo, but with the array distribution we get a sharp drop of patches at 2+ elo (the numbers in the arrays are eyeballed to correspond to vondele's measured elo estimates graph with some smoothing for +2 elo-gainer). [1,4] + [0, 3] and [0.5, 4] + [0, 3] are then too costly.

I've tried some bounds tweaking. but it's hard to get a significant improvement, and not worth it considering another set of data for patch distribution may change where the optimal point lies. It would likely be better to just apply the [0.5, 4.5] + [0, 3.5] bounds and gather new data about fishtest usage in a few months to evaluate if another tweak may be worth it.

Meanwhile, maybe we can also get improvements to the opening book used in fishtest as suggested in #1853 and in other elements which add noise to the testing data.

mcostalba commented 5 years ago

@snicolet what do you think?

snicolet commented 5 years ago

I'm fine with the proposed [0.5, 4.5] + [0,3.5] : indeed we can evaluate the change in a few months to estimate if it works in practice.

The most impressive feature in the discussion has been the quality of the argumentation, as usual! :-)

vondele commented 5 years ago

I'm also fine with [0.5, 4.5] + [0,3.5] .

I'd like to make one remark on speculative LTC (or testing how this scales, etc) in this context... these will be the most expensive tests to run, and have more risk to regress. As such I would propose a) to approve the only exceptionally, and if they pass, require a second LTC to pass.

I assume we keep the simplifications at [-3, 1].
Do we change bounds on param tweaks? [0, 4] + [0, 3.5] could make sense.

mcostalba commented 5 years ago

@vondele I'd suggest one change at a time, let's start with this one as a "beta", in some weeks we can eventually promote to "production" or, in the worst case, to revert.

@snicolet how do you suggest to deploy this change? Should we contact fishtest devs and ask them to change the defaults? Maybe even better is to open a PR on fishtest repo.

vdbergh commented 5 years ago

It appears a skew normal distribution

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skewnorm.html

is a reasonable model for the elo prior. We can use this prior to compute the theoretical distribution of the median unbiased estimator used by Vondele and compare it with the empirical distribution obtained from Vondele's raw data. I did some visual fitting and it appears parameters a=-4,scale=2.4,loc=0.7 yield a good visual fit (I did not actually calculate the theoretical distribution but used simulation instead). The graph of this prior is here

http://hardy.uhasselt.be/Toga/skew_norm.png

The moments of the prior are given by mu=-1.158 sigma=1.520 skewness=-0.784 excess kurtosis=0.632. Note that this graph is less spread out than Vondele's graph since it aims to correct for the noise in the elo estimator (which is quite large for SPRT tests).

Of course none of this should be used since visual fitting is unreliable. I will make a more formal fit later. One thing I am worried about though is that the fitting will place too much emphasis on the less important negative elo part since there is so much more data there. We'll see.

vondele commented 5 years ago

@vdbergh in addition to the prior would be interesting to see how the simulation with that prior matches the data. For a more formal fit you could consider weighting with e.g. the cost of the combined sprt simulation. That should put the emphasis where it is needed most.

vdbergh commented 5 years ago

@vondele Today I am busy the whole day but I will post the simulation result tomorrow. Note that this is just a proof of concept and I did not try to push it very far since running the simulations is very time consuming. For formally matching the theoretical and empirical distributions of the observations I was planning to use this method https://en.wikipedia.org/wiki/Maximum_spacing_estimation since it seems easiest. Some form of weighing might indeed be desirable (informally: we look for a model predicting costs and we want to give more weight to the correct prediction of higher costs). I have to think about it.

vdbergh commented 5 years ago

Here is the result of a simulation of the adhoc visual fit

http://hardy.uhasselt.be/Toga/adhoc_model_sim.png

(blue is the model, green is the data). The simulation was done with this elo prior

http://hardy.uhasselt.be/Toga/skew_norm.png

which is a skew_normal distribution

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skewnorm.html

with parameters a=-4,scale=2.4,loc=0.7.

To get a feel for the predictive power of such a model I also compared the average length/pass rate of the data with the simulated ones. The result is (with 95% confidence intervals)

data_length=18270(17884,18657) 
sim_length= 18526(18128,18924) 
data_pass=   0.031(0.027,0.036) 
sim_pass=    0.042(0.037,0.047)

So the prediction of the expected length is surprisingly good. The pass rate prediction is a bit less good but still not dramatically bad.

Of course I know it is not theoretically correct to validate a model on the same data that was used to fit it, but since the fitting did not use these two particular bench marks (length and pass rate) there may be some value in it. Once I have done a proper fit I will use theoretically correct cross validation

vondele commented 5 years ago

@vdbergh very nice results and excellent fit... I'll see if I can update the notebook to use this form later tonight.

The elo prior also confirms the statement I've made above, that >2 elo patches are very few.

xoto10 commented 5 years ago

A suggestion for the color-coding; we could halve the size of the existing yellow band, so instead of everything with W>L being yellow or green, only the top half of the current yellow band would be yellow, the bottom half would either join the red "fail" band, or we could create a new "neutral" result (no color?) for results where W~=L. We could store the result in the db as well, which would be useful for later retrieval.

vondele commented 5 years ago

Notebook updated with the Elo prior by @vdbergh

vdbergh commented 5 years ago

Notebook updated with the Elo prior by @vdbergh

Thx! I am doing the formal fitting now. It is an interesting learning experience.

EDIT: Well after some experimenting it seems that formally fitting a skew normal elo prior to the observations (using some reasonable loss function) does not decisively improve its predictive power (usually the cost prediction is quite good, but the pass prediction is overly optimistic compared to the date we have). So it seems the adhoc, visually fitted, elo prior might be good enough for now (given that fishtest is too erratic anyway to expect a rigid underlying model governing it).

Alayan-stk-2 commented 5 years ago

I ran the tool updated by the efforts of @vdbergh and @vondele ; the skewed normal distribution is a nicer model than the very rough linear array interpolation. Though, for some reason, it still shows much less patches losing more than 2 elo than the linear array interpolation - is it because of the bias of the raw SPRT elo results ?

Once again, with a normal-like distribution, we get a smaller number of high-elo gain patches represented and so a higher elo-gain ratio.

I get those results :

Limits [0,5] + [0,5] [0.5, 4.5] + [0,3.5] [1,4] + [0,3] [0.5, 4] + [0, 3] [0.5, 4] + [0, 3.5] [0.4, 4.4] + [0, 3.2]
0 ELO pass prob 0.0025 0.00123 0.00036 0.0011 0.0011 0.0014
1 ELO pass prob 0.0744 0.1002 0.0805 0.1451 0.1183 0.1276
+1 ELO rejectance ratio 0.8201 0.7547 0.7625 0.6718 0.7049 0.7137
+1.5 ELO rejectance ratio 0.6310 0.5160 0.4865 0.3968 0.4249 0.4679
+2 ELO rejectance ratio 0.3936 0.2709 0.2026 0.1700 0.1819 0.2374
total ELO gain ratio 1.0 1.3192 1.2373 1.7596 1.5664 1.5523
-0 ELO acceptance ratio 2.5 e-04 9.1e-05 2.2e-0.5 1.0 e-04 7.77 e-05 1.0 e-04
Avg. STC cost 18431 24456 34574 30936 30936 25110
Avg. STC + LTC Cost 27931 38039 45286 50604 46093 42990 

[0.5, 4.5] + [0, 3.5] is clearly superior to [1, 4] + [0,3], consuming significantly less resources while getting more elo (albeit by passing somewhat inferior patches on average).

Meanwhile, [0.5, 4] + [0,3] gets an awesome elo gain ratio with this patch distribution, albeit at the highest overall cost.

This pushed me to try some other variations. [0.5, 4] + [0, 3.5] reduces the overall cost by around 10% but gives a big hit to elo gain ratio compared to [0.5, 4] + [0, 3]. Instead, [0.4, 4.4] + [0, 3.2] strike a better balance. Compared to [0.5, 4.5] + [0, 3.5], it increases costs by +12% while the expected elo gain with this patch distribution is +17% (and usually, squeezing out more elo out of the same patches require a disproportionate increase in resource use).

I still think it's fine to do [0.5, 4.5] + [0, 3.5] now to see how the behavior of people submitting tests adapt to this and collecting new and more relevant data on patch distribution. But if this proves successful, then it will likely be worth it to push further.

vdbergh commented 5 years ago

Though, for some reason, it still shows much less patches losing more than 2 elo than the linear array interpolation - is it because of the bias of the raw SPRT elo results ?

It is not caused by bias but by the noise of the elo estimator (fixed lengths tests would have the same issue).

Assume every patch would be exactly -1 elo. The elo estimator will return values according to a certain probability distribution whose median is -1. If one pretends this observed distribution is the true underlying elo distribution (which is this example is concentrated at -1) one gets incorrect results.

So the challenge is to invert the transformation underlying distribution -> observed distribution.
This is what I tried to do.

Alayan-stk-2 commented 5 years ago

So the challenge is to invert the transformation underlying distribution -> observed distribution. This is what I tried to do.

Great, I had concerns about this very issue, it's nice that you was able to address it. This means that those latest estimates and the resulting table I produced above are the most reliable we have for now.

vdbergh commented 5 years ago

@vondele @Alayan-stk-2

In the end it turns out that a simple normal distribution is sufficient to explain the observations!! The very skewed nature of the elo measurements is probably simply due to the behavior of SPRT tests and not to the underlying elo distribution. Here is my proposal for the new prior.

Elo prior

As indicated it has mu=-1.013735, sigma=1.101212. I have done the fit using the cross entropy statistic after first filtering for outliers (although this turns out not to affect the result in the end).

Here are the results of a simulation for the median unbiased elo estimator.

Simulation

Of course a slightly better fit can be obtained by using a skew normal distribution with a non-zero shape parameter, but the optimal value for the shape parameter seems to be rather unstable. I have verified this using resampling https://en.wikipedia.org/wiki/Resampling_(statistics) . In other words fitting a skew normal distribution to the data is not so much fitting but rather over fitting instead. And of course Occam's razor also dictates that one should go for the simplest model that fits the data.

Finally the prediction the prior predicts gives as expectation values 18220, 0.037 for respectively the length of a test and its pass rate. This is very close to the data which has 18270(17884,18657) and 0.031(0.027,0.036).

EDIT. I succeeded in calculating the theoretical distribution of the median unbiased elo estimator. Here is the comparison of the theoretical distribution (assuming the above prior) with the observed data

Theoretical comparison

I think this is even more convincing than the simulation!

EDIT2. Here is a graph illustrating the filtering effect of STC(0,5) and LTC(0,5). The resulting distributions seem to be very close to normal (which is nice).

Patches submitted to LTC

In principle it is testable if the LTC prior corresponds to reality but the common use of speculative LTC obscures the available data.

Note that the model predicts that the expected elo for a patch which passes STC(0,5)+LTC(0,5) is equal to 1.44. This is quite a bit higher than the figure of 55/80=0.69 mentioned above by @Vizvezdenec which is obtained from SF9->SF10 development. One should note however that many things are unmodeled.

Finally the figure 0.69 comes with its own uncertainty which I would not know immediately how to quantify.

vdbergh commented 5 years ago

(Here I am recording a thought, perhaps for later reference.) With a reasonable elo prior the Bayesian approach for estimating elo actually becomes feasible (with the usual - unrealistic - uniform prior it gives heavily biased results for SPRT tests). The Bayesian approach has the potential to be substantially more accurate than the currently available methods which are purely frequentist and hence do not use historical information. Note that as the elo prior is likely to change over time it would have to be continuously updated based on the last N tests. Another complicating issue is that the elo prior depends of course on the type of test (simplification or elo mining).

Estimating elo using an elo prior can be done on the client side (it is trivial mathematically for a normal prior) but updating the hyper parameters of the elo prior(s) (mu,sigma for the normal model) is something that would have to be done on the fishtest side (one can use a Bayesian method for this as well).

FauziAkram commented 5 years ago

When will the bounds changes take place?

vdbergh commented 5 years ago

For the record: I am not terribly convinced that the new bounds are better. The elo_gain/STC_game ratio actually seems to go down a bit (see e.g. the data by @Alayan-stk-2 above). Of course if one considers fishtest resources to be infinite then the situation is different.

EDIT. A good argument for the change though (proposed by @vondele I think) is that narrower bounds give more accurate elo estimates. This is of scientific value.

EDIT2. Also fewer very low or negative elo patches will make it through. This should also be a good thing since apparently it can be done without affecting the metric elo_gain/STC_game very much.

EDIT3. For those who are wondering. For the normal prior the current elo_gain/STC_game is (if I did not make a mistake) 5.7e-6. After the change it would be 5.5e-6.