Closed joergoster closed 5 years ago
Seems reasonable. Let us know how it goes :-) If you use GP you should also try playing around with the noise parameter. Helped me.
Thanks!
My fix was a bad one. It uses the same opening for all the games, and accidentally changes the default from 2 to 1.
I get an error when running.
Traceback (most recent call last): File "tune-jo.py", line 438, in
asyncio.run(main()) File "/usr/lib/python3.7/asyncio/runners.py", line 43, in run return loop.run_until_complete(main) File "/usr/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete return future.result() File "tune-jo.py", line 287, in main m = re.match('count=(\d+)', n) NameError: name 're' is not defined
Edit: this was with set adjudication rule, of course.
Missing import re.
Check commit 1f00dfd9. This time I have actually tested it :-) (aka, we really need unit tests.)
@thomasahle To be honest, I would even prefer the default of 1 game. When measuring elo difference(s) between engines, it makes a lot of sense to play each opening twice with reversed colours. But for tuning purposes we want to gather as much information as we can. However, it probably won't matter much in the end. ;-) Oops, just not in time.
@fsmosca Thanks.
Will give it a try after breakfast. :-)
OK, everything seems to be working as expected now. I do like the normalization of the results into [-1, +1].
What I do observe is that the optimizer seems to settle too early on some values, although these runs were done with 200 iterations only. I think we need to run some combination where we can tell the optimizer to prefer some more exploration, or we need to increase the initial random sampling significantly.
@joergoster Did you try increasing the -acq-noise
?
Now that you can increase games per encounter, maybe you can go back to using GP.
I like -acq-noise 10
for optimizing just two parameters.
I don't know if the other methods have similar methods for forcing them to explore more.
@thomasahle Yes, I tried -acq-noise 10 and 20 with GP. In both cases the values settled to the min or max values of the parameters. The most 'balanced' values are the ones obtained with RF or ET as base estimator and LCB or EI as acquisition function, though. But there are so many possible combinations, one can hardly test them all.
Here https://towardsdatascience.com/hyperparameter-optimization-in-python-part-1-scikit-optimize-754e485d24fe is an interesting article where good results were achieved with ET and EI respectively.
He also mentions that the LCB acquisition function may be tweaked by kappa
, and the EI and PI acquisition functions by xi
.
The
forest_minimize
method was the clear winner but in order to get good results, it was crucial to tweak the (hyper)hyperparameters a bit. For the LCB acquisition function, a lower value of kappa (exploitation) was better.However, the best result was achieved with the EI acquisition function. Again, tweaking the xi parameter was needed.
Well, for now I want to know if increasing the number of games per encounter does help or not. I think I will also go with the ET and EI combination for now. I will run 4,000 iterations each, with 2, 8 and 16 games per encounter, validate the results and report back in a few days. ;-)
Sounds good. Meanwhile I've added xi and kappa support :-)
The first results are quite disappointing. :-(
The 4,000 iterations with the default of 2 games per encounter gave these values
Summarizing best values
Best expectation (κ=0): [49 66 78 78]
Best expectation (κ=1): [45 77 37 18]
Best expectation (κ=2): [45 72 26 0]
Best expectation (κ=3): [45 72 26 0]
which resulted in a loss of about 8 elo after 4,000 games of validation.
Elo difference: -7.5 +/- 9.9, LOS: 7.0 %, DrawRatio: 14.9 %
And the ones achieved with 8 games per encounter
Summarizing best values
Best expectation (κ=0): [79 70 41 8]
Best expectation (κ=1): [64 65 18 1]
Best expectation (κ=2): [73 80 13 13]
Best expectation (κ=3): [73 80 13 13]
performed even worse, about -30 elo !!!
Elo difference: -29.1 +/- 9.9, LOS: 0.0 %, DrawRatio: 15.1 %
Honestly, I don't think I will spend the time and resources to run the test with 16 games per encounter ...
That's indeed disappointing. Did you also run 4 games per encounter? Which acquisition function are you using and which what parameters? Do you still think this is an issue of overfitting?
No, only 2 and 8 games per encounter so far.
Parameters in use:
-base-estimator ET -acq-func EI -n 4000 -n-initial-points 400 -win-adj count=4 score=400
Not sure how to continue ...
Do we have to modify the objective function, respectively what we pass to the optimizer? Currently we pass single results, so a maximum result of +1 at a single point is considered a minimum by the optimizer. Probably THE minimum ...
But we want to maximize over ALL results, so somehow we need to pass the summed up results, no? I hope you get what I mean. Edit: Not sure though if this is how Bayesian Optimization works.
I thought you had decided that GBRT was best in your case?
Maybe try upping kappa (to something like 10?) and reducing n-initial-points.
Also, are you sure EI is the best acq-func? Why not just leave it at the default?
For acq-func other than LCB you need to control exploration with the xi parameter. Not sure whether higher or lower values are better.
I know this whole question is about games-per-encounter, but I still don't think it should be necessary if exploration is configured properly.
Well, that was after some very preliminary tests only. ;-) But I will try increasing kappa and reducing n-initial-points.
Skopt should support multiple different values at the same point. So there shouldn't be a need to pass a sum. It doesn't expect a single value is "the truth".
OK, after some more thinking I guess you're right. Reducing the noisy nature of our evaluations (by playing more games per encounter) probably only by a small amount anyways, is likely less important than to have a sufficiently great number of sample points covered. And finding the right balance between exploration and exploitation seems also very important for the case of chess.
There is still one thing in your script that I do not understand, though. How do you calculate the best values? Do you happen to have a link where this is being explained?
Thanks for this clarification. I didn't know this.
I think I will try lowering kappa, too! It's possible that the optimizer works better with more exploitation. Who knows ...
I have also been thinking that chaning 'acq_optimizer' from 'lbfgs' to 'sampling' might be a good idea. This changes the way skopt picks the next point to explore. It needs to find the position with best LCB, EI or PI, and it can do this either by trying a lot of random points (sampling) or "gradient descent" function optimization. Since our functions are extremely noisy, the descent type approach might not find a very good value, even if out LCB/EI/PI settings are actually good.
Sampling random points is also how I calculate the best values at the end of the script.
Well actually I do a mix of random values and all the values that the optimizer have visited so far.
We currently do n
random samples. It is possible that increasing this (to say 10*n or just a constant 10_000) would improve our final prediction as well.
Interesting.
For further experiments I just tweaked my test engine (Stockfish) which allows me to run tests with a much decreased parameter space. Shrinking the integer interval for the 4 parameters from [0, 80] to [0, 8] which significantly reduces the search space from 81^4 to 9^4. This will hopefully help to get meaningful results much faster.
@thomasahle Me again. I'm almost convinced that this optimization method simply doesn't work for chess.
Look at this:
Loading engines
Parsing options
Unable to open data08-gbrt-lcb-800.log
Starting 2 games 0/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 1, 'mRookOnFile[1]': 4, 'eRookOnFile[1]': 0}
Starting 2 games 1/800 with {'mRookOnFile[0]': 2, 'eRookOnFile[0]': 4, 'mRookOnFile[1]': 8, 'eRookOnFile[1]': 6}
Starting 2 games 2/800 with {'mRookOnFile[0]': 1, 'eRookOnFile[0]': 1, 'mRookOnFile[1]': 8, 'eRookOnFile[1]': 1}
Starting 2 games 3/800 with {'mRookOnFile[0]': 5, 'eRookOnFile[0]': 0, 'mRookOnFile[1]': 0, 'eRookOnFile[1]': 3}
Best params: [4, 1, 4, 0]
Finished game 0 [4, 1, 4, 0] => -1.0 (1-0, 0-1)
Starting 2 games 4/800 with {'mRookOnFile[0]': 2, 'eRookOnFile[0]': 4, 'mRookOnFile[1]': 3, 'eRookOnFile[1]': 3}
Best params: [5, 0, 0, 3]
Finished game 3 [5, 0, 0, 3] => 0.0 (1-0, 1-0)
Starting 2 games 5/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 3, 'mRookOnFile[1]': 4, 'eRookOnFile[1]': 0}
Best params: [5, 0, 0, 3]
Finished game 1 [2, 4, 8, 6] => -0.5 (1/2-1/2, 0-1)
Starting 2 games 6/800 with {'mRookOnFile[0]': 5, 'eRookOnFile[0]': 0, 'mRookOnFile[1]': 2, 'eRookOnFile[1]': 5}
Best params: [5, 0, 0, 3]
Finished game 2 [1, 1, 8, 1] => 0.0 (1/2-1/2, 1/2-1/2)
Starting 2 games 7/800 with {'mRookOnFile[0]': 5, 'eRookOnFile[0]': 2, 'mRookOnFile[1]': 8, 'eRookOnFile[1]': 8}
Best params: [5, 0, 0, 3]
Finished game 4 [2, 4, 3, 3] => 0.0 (1-0, 1-0)
Starting 2 games 8/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 5, 'mRookOnFile[1]': 5, 'eRookOnFile[1]': 5}
Best params: [5, 0, 0, 3]
Finished game 8 [4, 5, 5, 5] => 0.0 (1-0, 1-0)
Starting 2 games 9/800 with {'mRookOnFile[0]': 1, 'eRookOnFile[0]': 0, 'mRookOnFile[1]': 1, 'eRookOnFile[1]': 0}
Best params: [5, 0, 0, 3]
Finished game 7 [5, 2, 8, 8] => 0.0 (1-0, 1-0)
Starting 2 games 10/800 with {'mRookOnFile[0]': 6, 'eRookOnFile[0]': 0, 'mRookOnFile[1]': 8, 'eRookOnFile[1]': 0}
Best params: [5, 0, 2, 5]
Finished game 6 [5, 0, 2, 5] => 0.5 (1/2-1/2, 1-0)
Starting 2 games 11/800 with {'mRookOnFile[0]': 6, 'eRookOnFile[0]': 5, 'mRookOnFile[1]': 4, 'eRookOnFile[1]': 4}
Best params: [5, 0, 2, 5]
Finished game 10 [6, 0, 8, 0] => 0.0 (0-1, 0-1)
Starting 2 games 12/800 with {'mRookOnFile[0]': 3, 'eRookOnFile[0]': 6, 'mRookOnFile[1]': 7, 'eRookOnFile[1]': 3}
Best params: [5, 0, 2, 5]
Finished game 5 [8, 3, 4, 0] => 0.5 (0-1, 1/2-1/2)
Starting 2 games 13/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 8, 'mRookOnFile[1]': 3, 'eRookOnFile[1]': 8}
Best params: [5, 0, 2, 5]
Finished game 11 [6, 5, 4, 4] => 0.0 (1-0, 1-0)
Starting 2 games 14/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 2, 'mRookOnFile[1]': 0, 'eRookOnFile[1]': 3}
Best params: [5, 0, 2, 5]
Finished game 9 [1, 0, 1, 0] => 0.0 (1-0, 1-0)
Starting 2 games 15/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 1, 'mRookOnFile[1]': 2, 'eRookOnFile[1]': 0}
Best params: [5, 0, 2, 5]
Finished game 12 [3, 6, 7, 3] => 0.0 (1-0, 1-0)
Starting 2 games 16/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 4, 'mRookOnFile[1]': 2, 'eRookOnFile[1]': 0}
Best params: [8, 8, 3, 8]
Finished game 13 [8, 8, 3, 8] => 1.0 (0-1, 1-0)
Starting 2 games 17/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 8, 'mRookOnFile[1]': 2, 'eRookOnFile[1]': 5}
Best params: [8, 8, 3, 8]
Finished game 15 [4, 1, 2, 0] => 0.0 (1-0, 1-0)
Starting 2 games 18/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 8, 'mRookOnFile[1]': 2, 'eRookOnFile[1]': 5}
Best params: [8, 8, 3, 8]
Finished game 18 [4, 8, 2, 5] => 0.0 (0-1, 0-1)
What do I do? The optimizer tracks the current best values in a result object, which is generated in the tell() method. See https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/optimizer/optimizer.py#L552-L554 So I simply do
result = opt.tell(x, -y) # opt is minimizing
print(f'Best params: {result.x}')
First, the best params are taken from the 1st iteration. (-1.0) ==> Best params: [4, 1, 4, 0] They immediately change because already the next result is better. (0.0) ==> Best params: [5, 0, 0, 3] Then it takes some iterations until we have a better result (0.5) ==> Best params: [5, 0, 2, 5] Finally, at finished game 13, we get a result of (1.0) ==> Best params: [8, 8, 3, 8]
And after this point they never change again! Even after the completion of 800 iterations. Simply because we can't get a better result than (1.0)! Never!
Here the last iterations:
Best params: [8, 8, 3, 8]
Finished game 794 [8, 8, 0, 0] => 0.0 (0-1, 0-1)
Starting 2 games 796/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 8, 'mRookOnFile[1]': 3, 'eRookOnFile[1]': 6}
Best params: [8, 8, 3, 8]
Finished game 790 [8, 8, 3, 7] => 0.0 (1-0, 1-0)
Starting 2 games 797/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 8, 'mRookOnFile[1]': 3, 'eRookOnFile[1]': 8}
Best params: [8, 8, 3, 8]
Finished game 793 [8, 8, 0, 0] => 0.5 (1/2-1/2, 1-0)
Starting 2 games 798/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 2, 'mRookOnFile[1]': 3, 'eRookOnFile[1]': 0}
Best params: [8, 8, 3, 8]
Finished game 796 [8, 8, 3, 6] => 0.5 (0-1, 1/2-1/2)
Starting 2 games 799/800 with {'mRookOnFile[0]': 7, 'eRookOnFile[0]': 2, 'mRookOnFile[1]': 4, 'eRookOnFile[1]': 0}
Best params: [8, 8, 3, 8]
Finished game 797 [8, 8, 3, 8] => 1.0 (0-1, 1-0)
Best params: [8, 8, 3, 8]
Finished game 795 [8, 8, 4, 5] => 0.5 (0-1, 1/2-1/2)
Best params: [8, 8, 3, 8]
Finished game 799 [7, 2, 4, 0] => 1.0 (0-1, 1-0)
Best params: [8, 8, 3, 8]
Finished game 798 [8, 2, 3, 0] => -0.5 (1-0, 1/2-1/2)
Summarizing best values
Best expectation (κ=0): [6 6 4 8]
Best expectation (κ=1): [0 0 1 6]
Best expectation (κ=2): [0 0 1 6]
Best expectation (κ=3): [0 0 1 6]
Final params: [8, 8, 3, 8]
Conclusion: This optimization method simply looks for the single best minimum and doesn't work for chess.
P.S. However, I already have an idea for a simple weighting system of the positive result of a mini-match and their corresponding params. And I intend to simply use the dummy_minimize() method for basic random sampling.
Interesting. Your "Best params" seem different from the "Best expectation" Can you send me the command you use? I'll try to see if I can figure out why it's not working.
You also can't trust result.x
. If you notice what create_value
does, it is simply best = np.argmin(yi); res.x = Xi[best]
. That is, it'll always just be the parameters of the first 1.0 value. If you want a running 'Best', you should call summarize
in tune.py.
Indeed, I simply output result.x
.
I couldn't give up, so here is some small update. :-)
I increased the number of games per configuration to 64 and 128 and my impression is, that this might indeed work better. Similar experience has been reported by Nicu Ionita on talkchess who even played 1k to 2.5k games per configuration! Yet this seems rather impractical!
However, the optimizer still very aggressively focuses on early good results even if I increase kappa. I will now try changing xi
with EI as aqusition function.
As you already mentioned res.x
cannot be trusted since it simply sticks to the min value.
So I think there are three points on which we need to improve:
I've attached the log file of my testrun with 64 games, GBRT and kappa=5.0. data64-gbrt-k5-100.log
And here are the summarized results followed by the sample points with the best results.
Summarizing best values Best expectation (κ=0): [8 6 7 6] Best expectation (κ=1): [4 3 7 6] Best expectation (κ=2): [8 8 0 4] Best expectation (κ=3): [8 8 0 4]
Finished game 26 [3, 4, 4, 0] => -0.1875 Finished game 55 [5, 0, 8, 0] => 0.203125 Finished game 58 [3, 4, 6, 5] => 0.1875 Finished game 71 [2, 0, 4, 2] => 0.1875 Finished game 84 [2, 1, 3, 0] => 0.1875 Finished game 94 [2, 1, 4, 2] => 0.1875
@thomasahle Finally success!
Here https://github.com/joergoster/fastchess/commit/875a466e12cae7278f227bc070d72a8630b3d066 you can find what worked for me. This was inspired by
Return a solution: either the point evaluated with the largest f (x), or the point with the largest posterior mean.
which you can find on top of page 3 in this paper: https://arxiv.org/abs/1807.02811#
Here is the result of a tuning of 8 parameters at once with 400 iterations (GP, LCB, kappa=10.0, 16 games per encounter, -n-initial-points 200):
Summarizing best values
Best expectation (κ= 0.0): [23, 28, 67, 10, 5, 80, 6, 24] = -0.177 ± 0.136 (ELO-diff -62.028 ± 50.310)
Best expectation (κ= 0.1): [32, 11, 33, 0, 135, 104, 6, 23] = -0.091 ± 0.095 (ELO-diff -31.670 ± 33.609)
Best expectation (κ= 0.3): [ 8, 37, 69, 34, 7, 26, 0, 1] = -0.215 ± 0.129 (ELO-diff -75.819 ± 48.583)
Best expectation (κ= 1.0): [62, 33, 69, 0, 56, 125, 11, 1] = -0.116 ± 0.104 (ELO-diff -40.344 ± 37.059)
Best expectation (κ= 3.2): [24, 0, 24, 0, 97, 124, 35, 29] = -0.110 ± 0.140 (ELO-diff -38.341 ± 50.309)
Best expectation (κ=10.0): [10, 0, 8, 41, 1, 25, 25, 10] = -0.207 ± 0.164 (ELO-diff -72.828 ± 62.412)
Best params from Result Object: [18, 9, 67, 5, 60, 18, 24, 2]
Best params from opt.ask() (mean of 10 points): [27, 25, 31, 39, 71, 72, 26, 18]
Best params from game results: [31, 29, 32, 31, 73, 63, 24, 19]
This resulted in
Elo difference: 12.3 +/- 10.0, LOS: 99.2 %, DrawRatio: 14.1 % Finished match
instead of
Elo difference: 7.6 +/- 10.0, LOS: 93.1 %, DrawRatio: 14.4 % Finished match with 4k games each.
P.S. 1: My initially chosen parameters for testing turned out to be a bad choice, because they only were better compared to zero values by a very small margin.
Elo difference: 0.8 +/- 10.0, LOS: 56.1 %, DrawRatio: 13.2 % Finished match
P.S. 2: Support of time controls for testing would be highly appreciated. P.S. 3: Fitting a new model took about 2 minutes in the end.
Very nice! So is the 12.3 Elo increase from the "mean of 10 points" approach?
The 2 minutes a model is certainly bad. Something I tried in the latest version is using ET, or another fast model, for choosing the points to search, and then fitting a new GP model when we need to output the best parameters, since GP seems to have more precise interpolation, and output time is really when we need that.
Ehat do you mean by "your initial chosen parameters"? Are they the tune arguments, or have you added a feature to give a specific set of parameters to try?
Very nice! So is the 12.3 Elo increase from the "mean of 10 points" approach?
Yes.
Ehat do you mean by "your initial chosen parameters"? Are they the tune arguments, or have you added a feature to give a specific set of parameters to try?
The tune arguments (eval parameters) with which I tried most of my experiments. As it turned out this was a bad choice, unfortunately, because there is likely not much to gain. Probably every tuning method would have failed to find some good values for them. :-(
I would like to do some more sessions to confirm this "mean of 10 points" approach is really working. This one good result could well be just by luck.
I started a new tuning run this afternoon, trying to tune 17 pawn eval parameters of current Stockfish master at the same time! Probably too optimistic ...
After the initial 300 iterations of random sampling, from a total of planned 600 iterations, progress slowed down noticeably due to fitting a new model at each iteration. Or is it the ask() method? Probably both. I wonder if it would be sufficient to only fit a new model every 10th iteration or so.
However, I think I tend to agree that GP as base estimator is giving better results than the other methods.
I wonder how the kappa=0 to 10 results worked on that case in which you got a good bonus from "mean of 10".
We can also try exchanging skopt for some other GP library. Any you think might be good?
https://github.com/fmfn/BayesianOptimization https://github.com/GPflow/GPflowOpt https://github.com/SheffieldML/GPyOpt
You probably know them already. At least the 1st one looks like still being maintained. The other ones were referenced in the above mentioned paper. I have no idea if one of them would be any better than skopt, though.
is almost impossible now. At least, I don't see an easy way. ;-) The main culprit seems to be the handling of the opening book. I don't want to play with the same opening over and over again, which produces many identical games/outcomes.
I wanted to do some tests with increasing number of games per encounter (8, 16, 64) to see, whether this would help convergence.