Feature proposal: confidence intervals of current optima expected scores

kiudee / chess-tuning-tools

A collection of scripts aimed at efficiently tuning chess engine parameters.

https://chess-tuning-tools.readthedocs.io/en/latest/

Other

52 stars 13 forks source link

Feature proposal: confidence intervals of current optima expected scores #84

Closed Claes1981 closed 4 years ago

Claes1981 commented 4 years ago

Are the "Estimated value" outputted after "Current optimum" of each iteration, the average score (where 2 wins is -1 and 2 losses is +1) the tuner expects if you would run a match ("engine1" against "engine2" under current settings) with the "Current optimum" parameters applied to "engine1", for an infinite number of rounds?

I wonder if it would be possible to also compute and print the confidence intervals for this score?

I am thinking, if the confidence intervals of the parameters are large, and at the same time, the confidence interval of the "Estimated value" is also large, it would indicate more samples/games are needed?

However, if the confidence interval of the "Estimated value" is small, while the confidence intervals of the parameters are still large, it would indicate that the parameters does not have a strong influence on engine strength?

kiudee commented 4 years ago

Are the "Estimated value" outputted after "Current optimum" of each iteration, the average score (where 2 wins is -1 and 2 losses is +1) the tuner expects if you would run a match ("engine1" against "engine2" under current settings) with the "Current optimum" parameters applied to "engine1", for an infinite number of rounds?

The score in the new version of chess-tuning-tools (≥0.5.0) is the negative Elo divided by 100. So if you multiply the score by -100 you get the (estimated) Elo rating of the optimum.

I wonder if it would be possible to also compute and print the confidence intervals for this score?

Printing the confidence interval for the score is very easy to do. Thanks for the suggestion.

I am thinking, if the confidence intervals of the parameters are large, and at the same time, the confidence interval of the "Estimated value" is also large, it would indicate more samples/games are needed?

However, if the confidence interval of the "Estimated value" is small, while the confidence intervals of the parameters are still large, it would indicate that the parameters does not have a strong influence on engine strength?

I think in general, if the confidence intervals for the parameters are large, then it always means that you would need more iterations to be sure of the best position. But I completely agree with your assessment that it could also be a region where the configurations are practically equivalent. Another good diagnostic is to look at the range of values on the partial dependence plot. If the range covers only ±5 Elo, then it is clear that the parameters have little effect and it will take a lot of games to converge.

Claes1981 commented 4 years ago

The score in the new version of chess-tuning-tools (≥0.5.0) is the negative Elo divided by 100. So if you multiply the score by -100 you get the (estimated) Elo rating of the optimum.

I see, and "engine2" with it's fixed parameters and it's given time control is then assumed to have Elo 0?

Another good diagnostic is to look at the range of values on the partial dependence plot. If the range covers only ±5 Elo, then it is clear that the parameters have little effect and it will take a lot of games to converge.

But it could also mean that you instead just have too few samples/games (as explained in the FAQ)? See for example this plot of an experiment with only 90 iterations/180 games so far, but with a total of 6 parameters to tune: 20200823-191852-90

I expect at least number of threads to be worth more than 1 Elo (as indicated in the plot), and I expect that I will need a lot more games to get closer to the parameter's true effect on Elo.

kiudee commented 4 years ago

The score in the new version of chess-tuning-tools (≥0.5.0) is the negative Elo divided by 100. So if you multiply the score by -100 you get the (estimated) Elo rating of the optimum.

I see, and "engine2" with it's fixed parameters and it's given time control is then assumed to have Elo 0?

The Elo of engine2 in our case is exactly the negative value of the Elo of engine1. At an Elo of 0, we have exactly 50% winrate.

Another good diagnostic is to look at the range of values on the partial dependence plot. If the range covers only ±5 Elo, then it is clear that the parameters have little effect and it will take a lot of games to converge.

But it could also mean that you instead just have too few samples/games (as explained in the FAQ)? See for example this plot of an experiment with only 90 iterations/180 games so far, but with a total of 6 parameters to tune:

I expect at least number of threads to be worth more than 1 Elo (as indicated in the plot), and I expect that I will need a lot more games to get closer to the parameter's true effect on Elo.

Yes, for a 6 parameter tune you need many more iterations (at least 1k).

Claes1981 commented 4 years ago

The score in the new version of chess-tuning-tools (≥0.5.0) is the negative Elo divided by 100. So if you multiply the score by -100 you get the (estimated) Elo rating of the optimum.

I see, and "engine2" with it's fixed parameters and it's given time control is then assumed to have Elo 0?

The Elo of engine2 in our case is exactly the negative value of the Elo of engine1. At an Elo of 0, we have exactly 50% winrate.

Thanks for the clarification, good to know.

Yes, for a 6 parameter tune you need many more iterations (at least 1k).

Thank you for your hint about the magnitude of needed iterations.

However, without deep understanding about how the Bayesian Gaussian Process Regression works, I guess that generally, if some of the parameters in a many parameters tune, have more effect on Elo than the others, they should converge towards the true optima faster (in less number of games) than those with less effect on Elo?

kiudee commented 4 years ago

However, without deep understanding about how the Bayesian Gaussian Process Regression works, I guess that generally, if some of the parameters in a many parameters tune, have more effect on Elo than the others, they should converge towards the true optima faster (in less number of games) than those with less effect on Elo?

True, as soon as the model has collected enough points to deduce that some parameters are more important than others, it will converge faster in these parameters.

Claes1981 commented 4 years ago

I have now completed 803 iterations/1606 games: 20201001-162933-803

The partial dependence plot of the Threads parameter now indicates about 15 Elo difference between 1 and 6 threads. I assume this shows that the tuner has found a probable real Elo effect of the Threads parameter, since the Elo difference is now much larger than the initial 2 Elo scale of the plots?

The plot indicates only 3 Elo difference between 6 and 8 threads so far. Can you at this point say that, according to the plot, it is slightly more probable the true optimum lies at 6 threads than at 8 threads, or is this difference probably only random noise?

(I am by this experiment curious too see among other things how good the use of hyper-threading is on my 4-core laptop, and the effect of different number of threads on strength also considering the down-clocking the CPU does when using more threads. I am also curious if I can find a good setting of the Syzygy-bases on my nvme-drive, or if I should turn them of completely for best performance.)

I really do not expect much effect of the Move Overhead parameter (0 to 5 seconds) since I have not noticed any time losses, and I use a time control where the engine gets 15 seconds extra at each move. So the circa 7 Elo difference the plot still shows after 1600 games for this parameter must be noise, although it is greater than the a little more than 1 Elo difference of the earlier plot?

Also I guess the fact that two of the parameters can only take 8 different discrete possible values reduces the required number of games and iterations compared to if all parameters were continuous?

Part of log:

2020-10-01 16:12:27,932 INFO     Importing 803 existing datapoints. This could take a while...
2020-10-01 16:29:18,070 INFO     Importing finished.
2020-10-01 16:29:18,071 INFO     Starting iteration 803
2020-10-01 16:29:32,335 INFO     Current optimum:
{'Threads': 6, 'SyzygyProbeDepth': 57, 'SyzygyProbeLimit': 2, 'Hash': 5391, 'Slow Mover': 590, 'Move Overhead': 3582}
2020-10-01 16:29:32,335 INFO     Estimated value: -0.4368 +- 0.148
2020-10-01 16:29:32,335 INFO     80.0% confidence interval of the value: (-0.6265, -0.2471)
2020-10-01 16:29:32,495 INFO     80.0% confidence intervals of the parameters:
Parameter         Lower bound  Upper bound
------------------------------------------
Threads                     2            7
SyzygyProbeDepth           24           99
SyzygyProbeLimit            1            6
Hash                     2049         8166
Slow Mover                257          985
Move Overhead            1236         4970

kiudee commented 4 years ago

The partial dependence plot of the Threads parameter now indicates about 15 Elo difference between 1 and 6 threads. I assume this shows that the tuner has found a probable real Elo effect of the Threads parameter, since the Elo difference is now much larger than the initial 2 Elo scale of the plots?

From this plot in conjunction with the confidence intervals I would deduce that it is likely that 1-3 threads and 8 threads are suboptimal.

The plot indicates only 3 Elo difference between 6 and 8 threads so far. Can you at this point say that, according to the plot, it is slightly more probable the true optimum lies at 6 threads than at 8 threads, or is this difference probably only random noise?

It is still uncertain, but yes I would say you can rule out 8 threads.

I really do not expect much effect of the Move Overhead parameter (0 to 5 seconds) since I have not noticed any time losses, and I use a time control where the engine gets 15 seconds extra at each move. So the circa 7 Elo difference the plot still shows after 1600 games for this parameter must be noise, although it is greater than the a little more than 1 Elo difference of the earlier plot?

I agree this looks like it could be noise. Could you post your data and config file? Then I can take a look.

Also I guess the fact that two of the parameters can only take 8 different discrete possible values reduces the required number of games and iterations compared to if all parameters were continuous?

Not so much - the integer parameters are internally also represented using a continuous parameter. Since the Gaussian process infers the smoothness of each parameter (with suitable a priori assumptions), it can also inter- and extrapolate. Thus continuous parameters are not really worse off than integer ones.

By the way - version 0.7.0b0 now features continuous log output from cutechess-cli which could be useful.

Claes1981 commented 4 years ago

Thanks,

Config file:

{
    "engines": [
        {
            "command": "nice --5 /partitions/Sandisk/xfs/media/data/chess/engines/Stockfish_development_versions/stockfish_20080713_bmi2_pgo",
            "fixed_parameters": {
                "Use NNUE": true,
                "EvalFile": "/partitions/Sandisk/xfs/media/data/chess/engines/Stockfish-NNUE/eval/Sergio-20200728-1442.bin",
                "SyzygyPath": "/partitions/nvme/xfs/chess/Syzygy",
                "Contempt": 11
            }
        },
        {
            "command": "nice --5 /partitions/Sandisk/xfs/media/data/chess/engines/Stockfish_development_versions/stockfish_20080713_bmi2_pgo",
            "fixed_parameters": {
                "Use NNUE": true,
                "EvalFile": "/partitions/Sandisk/xfs/media/data/chess/engines/Stockfish-NNUE/eval/Sergio-20200728-1442.bin",
                "Threads": 8,
                "Hash": 1024,
                "Contempt": 11,
                "Move Overhead": 300
            }
        }
    ],
    "parameter_ranges": {
        "Threads": "Integer(1, 8)",
        "SyzygyProbeDepth": "Integer(1, 100)",
        "SyzygyProbeLimit": "Integer(0, 7)",
        "Hash": "Integer(1, 8192)",
        "Slow Mover": "Integer(10, 1000)",
        "Move Overhead": "Integer(0, 5000)"
    },
    "engine1_tc": "15+15",
    "engine2_tc": "1+1",
    "rounds": 1,
    "opening_file": "empty.pgn",
    "adjudicate_draws": false,
    "adjudicate_resign": false
}

Data file (zipped to be able to upload): Stockfish_20080713_bmi2_pgo_S200728-1442_b.zip

kiudee commented 4 years ago

I would definitely try to switch to this setting now to let it focus on the optimum:

    "acq_function": "mes",
    "acq_function_samples": 1,

Claes1981 commented 4 years ago

Really? I am more interested in where the optimum is (or at least parameters that are close in performance to the absolute optimum), rather than the exact Elo/performance of that optimum. Is mes better than pvrs also for that in my case at this point?

The actual performance of the found parameters will I test in cutechess-cli matches/gauntlets against old parameters, once I don't seem to get any clearer results from the tuner in an acceptable number of more games.

Besides, the results after letting the tuner run for some (12 lastly) iterations in a row seems to differ quite a bit compared to right after resuming and re-initializing:

12th iteration after last resume (iteration 825 in total): 20201002-182842-825 A really weird 15 Elo difference of Move Overhead here.

First iteration after resuming again (still iteration 825): 20201002-184941-825

I have been running with --gp-initial-burnin 200 --gp-burnin 50 (on the command line) lately, I guess the different results might have something to do with the different burnin values... Maybe also just a sign that the number of games is not enough yet for the tuner to be sure?

Updated data: Stockfish_20080713_bmi2_pgo_S200728-1442_b.zip

kiudee commented 4 years ago

Really? I am more interested in where the optimum is (or at least parameters that are close in performance to the absolute optimum), rather than the exact Elo/performance of that optimum. Is mes better than pvrs also for that in my case at this point?

Predictive variance reduction search (PVRS, the default) estimates the distribution of the optimum and tries to sample points which reduce the uncertainty of points sampled from that distribution. While this is very robust, in high dimensions it runs into the problem that the optimum distribution is vast and it takes a long time until the optimum distribution is restricted enough for it to really focus.

Maximum-value entropy search (MES) only tries to learn about the value of the optimum. That distribution concentrates much faster and it also is evaluated pointwise - thus it is independent of the dimensionality of the space in that sense. Of course it also has its problems and is prone to run into local optima when the landscape is still underexplored.

Maybe also just a sign that the number of games is not enough yet for the tuner to be sure?

Indeed, that’s another way to see how uncertain the model still is here.

Claes1981 commented 4 years ago

Okay, thanks for the motivation. I guess in high dimensions then it might be the best strategy to run, if you can afford that many games, with PVRS until you think it has found the global optimum very roughly, and then switch to MES.

I guess a motivation to use MES could also be that the region around the optimum might be very steep in some dimensions/parameters, and by only using PVRS, even if you miss the optimum just slightly in parameter values, the miss in Elo might still be big.

Claes1981 commented 4 years ago

After running with PVRS until iteration 848 (except for only a few iteration with MES and VR), I then switched to MES from iteration 849. Comparing the confidence intervals at iteration 848 and iteration 1021:

2020-10-03 17:09:22,624 INFO     Importing 848 existing datapoints. This could take a while...
2020-10-03 17:28:56,520 INFO     Importing finished.
2020-10-03 17:28:56,523 INFO     Starting iteration 848
2020-10-03 17:29:12,646 INFO     Current optimum:
{'Threads': 6, 'SyzygyProbeDepth': 55, 'SyzygyProbeLimit': 2, 'Hash': 5486, 'Slow Mover': 600, 'Move Overhead': 3728}
2020-10-03 17:29:12,646 INFO     Estimated value: -0.4373 +- 0.1474
2020-10-03 17:29:12,646 INFO     80.0% confidence interval of the value: (-0.6262, -0.2484)
2020-10-03 17:29:12,877 INFO     80.0% confidence intervals of the parameters:
Parameter         Lower bound  Upper bound
------------------------------------------
Threads                     3            7
SyzygyProbeDepth           14           91
SyzygyProbeLimit            1            6
Hash                     2610         7958
Slow Mover                205          940
Move Overhead            1320         4845

2020-10-03 17:29:12,878 DEBUG    Starting to compute the next plot.

2020-10-14 18:46:32,198 INFO     Importing 1021 existing datapoints. This could take a while...
2020-10-14 19:13:11,969 INFO     Importing finished.
2020-10-14 19:13:11,970 INFO     Starting iteration 1021
2020-10-14 19:13:31,464 INFO     Current optimum:
{'Threads': 6, 'SyzygyProbeDepth': 56, 'SyzygyProbeLimit': 3, 'Hash': 5452, 'Slow Mover': 579, 'Move Overhead': 3341}
2020-10-14 19:13:31,465 INFO     Estimated value: -0.4105 +- 0.1223
2020-10-14 19:13:31,465 INFO     80.0% confidence interval of the value: (-0.5672, -0.2538)
2020-10-14 19:13:31,661 INFO     80.0% confidence intervals of the parameters:
Parameter         Lower bound  Upper bound
------------------------------------------
Threads                     2            7
SyzygyProbeDepth           18           95
SyzygyProbeLimit            1            6
Hash                     2019         7942
Slow Mover                224          984
Move Overhead              21         3788

2020-10-14 19:13:31,663 DEBUG    Starting to compute the next plot.

I saw something in the changelog about removing noise from the intervals, so I am not sure how reliable the intervals at iteration 848 above are. The confidence intervals of the value is anyway narrower at iteration 1021 (which I also would expect it to be). The "80.0% confidence interval of the value" are different from the interval on the line above, though, so I guess it gives a different interval? What are the differences between the intervals?

Then, the intervals of the parameters actually seem to be wider at iteration 1021, at least Threads and Hash. (I think I used --gp-initial-burnin 200 in both cases when resuming at those iterations.) Is that some kind of noise, or is it expected when switching to MES?

Full log: Stockfish_20080713_bmi2_pgo_S200728-1442_b.log Data: Stockfish_20080713_bmi2_pgo_S200728-1442_b.npz.zip