tune.py: changing the number of games per encounter

joergoster commented 5 years ago

is almost impossible now. At least, I don't see an easy way. ;-) The main culprit seems to be the handling of the opening book. I don't want to play with the same opening over and over again, which produces many identical games/outcomes.

I wanted to do some tests with increasing number of games per encounter (8, 16, 64) to see, whether this would help convergence.

thomasahle commented 5 years ago

Seems reasonable. Let us know how it goes :-) If you use GP you should also try playing around with the noise parameter. Helped me.

joergoster commented 5 years ago

Thanks!

thomasahle commented 5 years ago

My fix was a bad one. It uses the same opening for all the games, and accidentally changes the default from 2 to 1.

joergoster commented 5 years ago

I get an error when running.

Traceback (most recent call last): File "tune-jo.py", line 438, in asyncio.run(main()) File "/usr/lib/python3.7/asyncio/runners.py", line 43, in run return loop.run_until_complete(main) File "/usr/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete return future.result() File "tune-jo.py", line 287, in main m = re.match('count=(\d+)', n) NameError: name 're' is not defined

Edit: this was with set adjudication rule, of course.

fsmosca commented 5 years ago

Missing import re.

thomasahle commented 5 years ago

Check commit 1f00dfd9. This time I have actually tested it :-) (aka, we really need unit tests.)

joergoster commented 5 years ago

@thomasahle To be honest, I would even prefer the default of 1 game. When measuring elo difference(s) between engines, it makes a lot of sense to play each opening twice with reversed colours. But for tuning purposes we want to gather as much information as we can. However, it probably won't matter much in the end. ;-) Oops, just not in time.

@fsmosca Thanks.

joergoster commented 5 years ago

Will give it a try after breakfast. :-)

joergoster commented 5 years ago

OK, everything seems to be working as expected now. I do like the normalization of the results into [-1, +1].

What I do observe is that the optimizer seems to settle too early on some values, although these runs were done with 200 iterations only. I think we need to run some combination where we can tell the optimizer to prefer some more exploration, or we need to increase the initial random sampling significantly.

thomasahle commented 5 years ago

@joergoster Did you try increasing the -acq-noise? Now that you can increase games per encounter, maybe you can go back to using GP. I like -acq-noise 10 for optimizing just two parameters. I don't know if the other methods have similar methods for forcing them to explore more.

joergoster commented 5 years ago

@thomasahle Yes, I tried -acq-noise 10 and 20 with GP. In both cases the values settled to the min or max values of the parameters. The most 'balanced' values are the ones obtained with RF or ET as base estimator and LCB or EI as acquisition function, though. But there are so many possible combinations, one can hardly test them all.

Here https://towardsdatascience.com/hyperparameter-optimization-in-python-part-1-scikit-optimize-754e485d24fe is an interesting article where good results were achieved with ET and EI respectively. He also mentions that the LCB acquisition function may be tweaked by kappa, and the EI and PI acquisition functions by xi.

The forest_minimize method was the clear winner but in order to get good results, it was crucial to tweak the (hyper)hyperparameters a bit. For the LCB acquisition function, a lower value of kappa (exploitation) was better.

However, the best result was achieved with the EI acquisition function. Again, tweaking the xi parameter was needed.

Well, for now I want to know if increasing the number of games per encounter does help or not. I think I will also go with the ET and EI combination for now. I will run 4,000 iterations each, with 2, 8 and 16 games per encounter, validate the results and report back in a few days. ;-)

thomasahle commented 5 years ago

Sounds good. Meanwhile I've added xi and kappa support :-)

joergoster commented 5 years ago

The first results are quite disappointing. :-(

The 4,000 iterations with the default of 2 games per encounter gave these values

Summarizing best values
Best expectation (κ=0): [49 66 78 78]
Best expectation (κ=1): [45 77 37 18]
Best expectation (κ=2): [45 72 26  0]
Best expectation (κ=3): [45 72 26  0]

which resulted in a loss of about 8 elo after 4,000 games of validation. Elo difference: -7.5 +/- 9.9, LOS: 7.0 %, DrawRatio: 14.9 %

And the ones achieved with 8 games per encounter

Summarizing best values
Best expectation (κ=0): [79 70 41  8]
Best expectation (κ=1): [64 65 18  1]
Best expectation (κ=2): [73 80 13 13]
Best expectation (κ=3): [73 80 13 13]

performed even worse, about -30 elo !!! Elo difference: -29.1 +/- 9.9, LOS: 0.0 %, DrawRatio: 15.1 %

Honestly, I don't think I will spend the time and resources to run the test with 16 games per encounter ...

thomasahle commented 5 years ago

That's indeed disappointing. Did you also run 4 games per encounter? Which acquisition function are you using and which what parameters? Do you still think this is an issue of overfitting?

joergoster commented 5 years ago

No, only 2 and 8 games per encounter so far. Parameters in use: -base-estimator ET -acq-func EI -n 4000 -n-initial-points 400 -win-adj count=4 score=400

Not sure how to continue ...

joergoster commented 5 years ago

Do we have to modify the objective function, respectively what we pass to the optimizer? Currently we pass single results, so a maximum result of +1 at a single point is considered a minimum by the optimizer. Probably THE minimum ...

But we want to maximize over ALL results, so somehow we need to pass the summed up results, no? I hope you get what I mean. Edit: Not sure though if this is how Bayesian Optimization works.

thomasahle commented 5 years ago

I thought you had decided that GBRT was best in your case?

Maybe try upping kappa (to something like 10?) and reducing n-initial-points.

Also, are you sure EI is the best acq-func? Why not just leave it at the default?

For acq-func other than LCB you need to control exploration with the xi parameter. Not sure whether higher or lower values are better.

I know this whole question is about games-per-encounter, but I still don't think it should be necessary if exploration is configured properly.

joergoster commented 5 years ago

Well, that was after some very preliminary tests only. ;-) But I will try increasing kappa and reducing n-initial-points.

thomasahle commented 5 years ago

Skopt should support multiple different values at the same point. So there shouldn't be a need to pass a sum. It doesn't expect a single value is "the truth".

joergoster commented 5 years ago

OK, after some more thinking I guess you're right. Reducing the noisy nature of our evaluations (by playing more games per encounter) probably only by a small amount anyways, is likely less important than to have a sufficiently great number of sample points covered. And finding the right balance between exploration and exploitation seems also very important for the case of chess.

There is still one thing in your script that I do not understand, though. How do you calculate the best values? Do you happen to have a link where this is being explained?

joergoster commented 5 years ago

Thanks for this clarification. I didn't know this.

joergoster commented 5 years ago

I think I will try lowering kappa, too! It's possible that the optimizer works better with more exploitation. Who knows ...

thomasahle commented 5 years ago

I have also been thinking that chaning 'acq_optimizer' from 'lbfgs' to 'sampling' might be a good idea. This changes the way skopt picks the next point to explore. It needs to find the position with best LCB, EI or PI, and it can do this either by trying a lot of random points (sampling) or "gradient descent" function optimization. Since our functions are extremely noisy, the descent type approach might not find a very good value, even if out LCB/EI/PI settings are actually good.

Sampling random points is also how I calculate the best values at the end of the script. Well actually I do a mix of random values and all the values that the optimizer have visited so far. We currently do n random samples. It is possible that increasing this (to say 10*n or just a constant 10_000) would improve our final prediction as well.

joergoster commented 5 years ago

Interesting.

For further experiments I just tweaked my test engine (Stockfish) which allows me to run tests with a much decreased parameter space. Shrinking the integer interval for the 4 parameters from [0, 80] to [0, 8] which significantly reduces the search space from 81^4 to 9^4. This will hopefully help to get meaningful results much faster.

joergoster commented 5 years ago

@thomasahle Me again. I'm almost convinced that this optimization method simply doesn't work for chess.

Look at this:

Loading engines
Parsing options
Unable to open data08-gbrt-lcb-800.log
Starting 2 games 0/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 1, 'mRookOnFile[1]': 4, 'eRookOnFile[1]': 0}
Starting 2 games 1/800 with {'mRookOnFile[0]': 2, 'eRookOnFile[0]': 4, 'mRookOnFile[1]': 8, 'eRookOnFile[1]': 6}
Starting 2 games 2/800 with {'mRookOnFile[0]': 1, 'eRookOnFile[0]': 1, 'mRookOnFile[1]': 8, 'eRookOnFile[1]': 1}
Starting 2 games 3/800 with {'mRookOnFile[0]': 5, 'eRookOnFile[0]': 0, 'mRookOnFile[1]': 0, 'eRookOnFile[1]': 3}
Best params: [4, 1, 4, 0]
Finished game 0 [4, 1, 4, 0] => -1.0 (1-0, 0-1)
Starting 2 games 4/800 with {'mRookOnFile[0]': 2, 'eRookOnFile[0]': 4, 'mRookOnFile[1]': 3, 'eRookOnFile[1]': 3}
Best params: [5, 0, 0, 3]
Finished game 3 [5, 0, 0, 3] => 0.0 (1-0, 1-0)
Starting 2 games 5/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 3, 'mRookOnFile[1]': 4, 'eRookOnFile[1]': 0}
Best params: [5, 0, 0, 3]
Finished game 1 [2, 4, 8, 6] => -0.5 (1/2-1/2, 0-1)
Starting 2 games 6/800 with {'mRookOnFile[0]': 5, 'eRookOnFile[0]': 0, 'mRookOnFile[1]': 2, 'eRookOnFile[1]': 5}
Best params: [5, 0, 0, 3]
Finished game 2 [1, 1, 8, 1] => 0.0 (1/2-1/2, 1/2-1/2)
Starting 2 games 7/800 with {'mRookOnFile[0]': 5, 'eRookOnFile[0]': 2, 'mRookOnFile[1]': 8, 'eRookOnFile[1]': 8}
Best params: [5, 0, 0, 3]
Finished game 4 [2, 4, 3, 3] => 0.0 (1-0, 1-0)
Starting 2 games 8/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 5, 'mRookOnFile[1]': 5, 'eRookOnFile[1]': 5}
Best params: [5, 0, 0, 3]
Finished game 8 [4, 5, 5, 5] => 0.0 (1-0, 1-0)
Starting 2 games 9/800 with {'mRookOnFile[0]': 1, 'eRookOnFile[0]': 0, 'mRookOnFile[1]': 1, 'eRookOnFile[1]': 0}
Best params: [5, 0, 0, 3]
Finished game 7 [5, 2, 8, 8] => 0.0 (1-0, 1-0)
Starting 2 games 10/800 with {'mRookOnFile[0]': 6, 'eRookOnFile[0]': 0, 'mRookOnFile[1]': 8, 'eRookOnFile[1]': 0}
Best params: [5, 0, 2, 5]
Finished game 6 [5, 0, 2, 5] => 0.5 (1/2-1/2, 1-0)
Starting 2 games 11/800 with {'mRookOnFile[0]': 6, 'eRookOnFile[0]': 5, 'mRookOnFile[1]': 4, 'eRookOnFile[1]': 4}
Best params: [5, 0, 2, 5]
Finished game 10 [6, 0, 8, 0] => 0.0 (0-1, 0-1)
Starting 2 games 12/800 with {'mRookOnFile[0]': 3, 'eRookOnFile[0]': 6, 'mRookOnFile[1]': 7, 'eRookOnFile[1]': 3}
Best params: [5, 0, 2, 5]
Finished game 5 [8, 3, 4, 0] => 0.5 (0-1, 1/2-1/2)
Starting 2 games 13/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 8, 'mRookOnFile[1]': 3, 'eRookOnFile[1]': 8}
Best params: [5, 0, 2, 5]
Finished game 11 [6, 5, 4, 4] => 0.0 (1-0, 1-0)
Starting 2 games 14/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 2, 'mRookOnFile[1]': 0, 'eRookOnFile[1]': 3}
Best params: [5, 0, 2, 5]
Finished game 9 [1, 0, 1, 0] => 0.0 (1-0, 1-0)
Starting 2 games 15/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 1, 'mRookOnFile[1]': 2, 'eRookOnFile[1]': 0}
Best params: [5, 0, 2, 5]
Finished game 12 [3, 6, 7, 3] => 0.0 (1-0, 1-0)
Starting 2 games 16/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 4, 'mRookOnFile[1]': 2, 'eRookOnFile[1]': 0}
Best params: [8, 8, 3, 8]
Finished game 13 [8, 8, 3, 8] => 1.0 (0-1, 1-0)
Starting 2 games 17/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 8, 'mRookOnFile[1]': 2, 'eRookOnFile[1]': 5}
Best params: [8, 8, 3, 8]
Finished game 15 [4, 1, 2, 0] => 0.0 (1-0, 1-0)
Starting 2 games 18/800 with {'mRookOnFile[0]': 4, 'eRookOnFile[0]': 8, 'mRookOnFile[1]': 2, 'eRookOnFile[1]': 5}
Best params: [8, 8, 3, 8]
Finished game 18 [4, 8, 2, 5] => 0.0 (0-1, 0-1)

What do I do? The optimizer tracks the current best values in a result object, which is generated in the tell() method. See https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/optimizer/optimizer.py#L552-L554 So I simply do

                result = opt.tell(x, -y)  # opt is minimizing
                print(f'Best params: {result.x}')

First, the best params are taken from the 1st iteration. (-1.0) ==> Best params: [4, 1, 4, 0] They immediately change because already the next result is better. (0.0) ==> Best params: [5, 0, 0, 3] Then it takes some iterations until we have a better result (0.5) ==> Best params: [5, 0, 2, 5] Finally, at finished game 13, we get a result of (1.0) ==> Best params: [8, 8, 3, 8]

And after this point they never change again! Even after the completion of 800 iterations. Simply because we can't get a better result than (1.0)! Never!

Here the last iterations:

Best params: [8, 8, 3, 8]
Finished game 794 [8, 8, 0, 0] => 0.0 (0-1, 0-1)
Starting 2 games 796/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 8, 'mRookOnFile[1]': 3, 'eRookOnFile[1]': 6}
Best params: [8, 8, 3, 8]
Finished game 790 [8, 8, 3, 7] => 0.0 (1-0, 1-0)
Starting 2 games 797/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 8, 'mRookOnFile[1]': 3, 'eRookOnFile[1]': 8}
Best params: [8, 8, 3, 8]
Finished game 793 [8, 8, 0, 0] => 0.5 (1/2-1/2, 1-0)
Starting 2 games 798/800 with {'mRookOnFile[0]': 8, 'eRookOnFile[0]': 2, 'mRookOnFile[1]': 3, 'eRookOnFile[1]': 0}
Best params: [8, 8, 3, 8]
Finished game 796 [8, 8, 3, 6] => 0.5 (0-1, 1/2-1/2)
Starting 2 games 799/800 with {'mRookOnFile[0]': 7, 'eRookOnFile[0]': 2, 'mRookOnFile[1]': 4, 'eRookOnFile[1]': 0}
Best params: [8, 8, 3, 8]
Finished game 797 [8, 8, 3, 8] => 1.0 (0-1, 1-0)
Best params: [8, 8, 3, 8]
Finished game 795 [8, 8, 4, 5] => 0.5 (0-1, 1/2-1/2)
Best params: [8, 8, 3, 8]
Finished game 799 [7, 2, 4, 0] => 1.0 (0-1, 1-0)
Best params: [8, 8, 3, 8]
Finished game 798 [8, 2, 3, 0] => -0.5 (1-0, 1/2-1/2)
Summarizing best values
Best expectation (κ=0): [6 6 4 8]
Best expectation (κ=1): [0 0 1 6]
Best expectation (κ=2): [0 0 1 6]
Best expectation (κ=3): [0 0 1 6]
Final params: [8, 8, 3, 8]

Conclusion: This optimization method simply looks for the single best minimum and doesn't work for chess.

P.S. However, I already have an idea for a simple weighting system of the positive result of a mini-match and their corresponding params. And I intend to simply use the dummy_minimize() method for basic random sampling.

thomasahle commented 5 years ago

Interesting. Your "Best params" seem different from the "Best expectation" Can you send me the command you use? I'll try to see if I can figure out why it's not working.

You also can't trust result.x. If you notice what create_value does, it is simply best = np.argmin(yi); res.x = Xi[best]. That is, it'll always just be the parameters of the first 1.0 value. If you want a running 'Best', you should call summarize in tune.py.

joergoster commented 5 years ago

Indeed, I simply output result.x.

joergoster commented 5 years ago

I couldn't give up, so here is some small update. :-)

I increased the number of games per configuration to 64 and 128 and my impression is, that this might indeed work better. Similar experience has been reported by Nicu Ionita on talkchess who even played 1k to 2.5k games per configuration! Yet this seems rather impractical! However, the optimizer still very aggressively focuses on early good results even if I increase kappa. I will now try changing xi with EI as aqusition function.

As you already mentioned res.x cannot be trusted since it simply sticks to the min value.

So I think there are three points on which we need to improve:

1. Play more games per configuration.
2. Find a working solution for exploration vs. exploitation problem.
3. We need to keep track of sample points with good results on our own.

joergoster commented 5 years ago

I've attached the log file of my testrun with 64 games, GBRT and kappa=5.0. data64-gbrt-k5-100.log

And here are the summarized results followed by the sample points with the best results.

Summarizing best values Best expectation (κ=0): [8 6 7 6] Best expectation (κ=1): [4 3 7 6] Best expectation (κ=2): [8 8 0 4] Best expectation (κ=3): [8 8 0 4]

Finished game 26 [3, 4, 4, 0] => -0.1875 Finished game 55 [5, 0, 8, 0] => 0.203125 Finished game 58 [3, 4, 6, 5] => 0.1875 Finished game 71 [2, 0, 4, 2] => 0.1875 Finished game 84 [2, 1, 3, 0] => 0.1875 Finished game 94 [2, 1, 4, 2] => 0.1875

joergoster commented 5 years ago

@thomasahle Finally success!

Here https://github.com/joergoster/fastchess/commit/875a466e12cae7278f227bc070d72a8630b3d066 you can find what worked for me. This was inspired by

Return a solution: either the point evaluated with the largest f (x), or the point with the largest posterior mean.

which you can find on top of page 3 in this paper: https://arxiv.org/abs/1807.02811#

Here is the result of a tuning of 8 parameters at once with 400 iterations (GP, LCB, kappa=10.0, 16 games per encounter, -n-initial-points 200):

Summarizing best values
Best expectation (κ= 0.0): [23, 28, 67, 10,   5,  80,  6, 24] = -0.177 ± 0.136 (ELO-diff -62.028 ± 50.310)
Best expectation (κ= 0.1): [32, 11, 33,  0, 135, 104,  6, 23] = -0.091 ± 0.095 (ELO-diff -31.670 ± 33.609)
Best expectation (κ= 0.3): [ 8, 37, 69, 34,   7,  26,  0,  1] = -0.215 ± 0.129 (ELO-diff -75.819 ± 48.583)
Best expectation (κ= 1.0): [62, 33, 69,  0,  56, 125, 11,  1] = -0.116 ± 0.104 (ELO-diff -40.344 ± 37.059)
Best expectation (κ= 3.2): [24,  0, 24,  0,  97, 124, 35, 29] = -0.110 ± 0.140 (ELO-diff -38.341 ± 50.309)
Best expectation (κ=10.0): [10,  0,  8, 41,   1,  25, 25, 10] = -0.207 ± 0.164 (ELO-diff -72.828 ± 62.412)
Best params from Result Object: [18, 9, 67, 5, 60, 18, 24, 2]
Best params from opt.ask() (mean of 10 points): [27, 25, 31, 39, 71, 72, 26, 18]
Best params from game results: [31, 29, 32, 31, 73, 63, 24, 19]

This resulted in

Elo difference: 12.3 +/- 10.0, LOS: 99.2 %, DrawRatio: 14.1 % Finished match

instead of

Elo difference: 7.6 +/- 10.0, LOS: 93.1 %, DrawRatio: 14.4 % Finished match with 4k games each.

P.S. 1: My initially chosen parameters for testing turned out to be a bad choice, because they only were better compared to zero values by a very small margin.

Elo difference: 0.8 +/- 10.0, LOS: 56.1 %, DrawRatio: 13.2 % Finished match

P.S. 2: Support of time controls for testing would be highly appreciated. P.S. 3: Fitting a new model took about 2 minutes in the end.

thomasahle commented 5 years ago

Very nice! So is the 12.3 Elo increase from the "mean of 10 points" approach?

The 2 minutes a model is certainly bad. Something I tried in the latest version is using ET, or another fast model, for choosing the points to search, and then fitting a new GP model when we need to output the best parameters, since GP seems to have more precise interpolation, and output time is really when we need that.

Ehat do you mean by "your initial chosen parameters"? Are they the tune arguments, or have you added a feature to give a specific set of parameters to try?

joergoster commented 5 years ago

Very nice! So is the 12.3 Elo increase from the "mean of 10 points" approach?

Yes.

Ehat do you mean by "your initial chosen parameters"? Are they the tune arguments, or have you added a feature to give a specific set of parameters to try?

The tune arguments (eval parameters) with which I tried most of my experiments. As it turned out this was a bad choice, unfortunately, because there is likely not much to gain. Probably every tuning method would have failed to find some good values for them. :-(

I would like to do some more sessions to confirm this "mean of 10 points" approach is really working. This one good result could well be just by luck.

joergoster commented 5 years ago

I started a new tuning run this afternoon, trying to tune 17 pawn eval parameters of current Stockfish master at the same time! Probably too optimistic ...

After the initial 300 iterations of random sampling, from a total of planned 600 iterations, progress slowed down noticeably due to fitting a new model at each iteration. Or is it the ask() method? Probably both. I wonder if it would be sufficient to only fit a new model every 10th iteration or so.

However, I think I tend to agree that GP as base estimator is giving better results than the other methods.

thomasahle commented 5 years ago

I wonder how the kappa=0 to 10 results worked on that case in which you got a good bonus from "mean of 10".

We can also try exchanging skopt for some other GP library. Any you think might be good?

joergoster commented 5 years ago

https://github.com/fmfn/BayesianOptimization https://github.com/GPflow/GPflowOpt https://github.com/SheffieldML/GPyOpt

You probably know them already. At least the 1st one looks like still being maintained. The other ones were referenced in the above mentioned paper. I have no idea if one of them would be any better than skopt, though.

thomasahle / fastchess

tune.py: changing the number of games per encounter #24