Open ppigazzini opened 4 years ago
From my experience on SPSA, the main problem is the high level of noise in the results. If any proposition reduce this noise, I agree with it :-) You said :
"one iteration should be set to a 2 games for match, but our worker code cannot support this, so we set one iteration to a 2*N cores gamer for match"
Can we choose them number N ? Increase it specially. I think that below 100 games, the result can be completely wrong and lead to a bad convergence.
@MJZ1977 the companion code of the seminal paper asks for the number of averaged SP gradients to be used per iteration. List updated, thank you :)
The experimental options "careful clipping" and "randomized rounding" don't seems to have a first order effect, so we could keep only one method to clip and to round.
@ppigazzini : what are the effects of these options? did they change the number N of games before updating parameters ?
@MJZ1977 "careful clipping" https://github.com/glinscott/fishtest/commit/7eebda7e6d1f47f2672aefe46db35baee7cb5b1f and randomized rounding https://github.com/glinscott/fishtest/commit/5f63500db3f40569ea406a8b8b4b987f054ee79f are theoretical improvements with little/no effect on SPSA convergence wrt other parameters. People stuck to default, so the GUI was simplified dropping the possibility to chose them. I will do some other tests and then I will simplify the code dropping the options not useful.
From what i'm finding online, alpha is usually 0.602, gamma at 0.101 is ok, and A is ~ 10% the number of iterations. would these be good defaults for the SPSA fields?
Sources: https://hackage.haskell.org/package/spsa-0.2.0.0/docs/Math-Optimization-SPSA.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4769712/ https://www.chessprogramming.org/SPSA https://www.jhuapl.edu/SPSA/PDF-SPSA/Spall_Implementation_of_the_Simultaneous.PDF
@linrock it makes definitely sense to have defaults for the fields (actually, I was thinking they had defaults...). Also @ppigazzini suggests to have A depend on the number of games. Shouldn't we call the field 'A[in %]' and give it a default of 10%, so that the field doesn't need to be adjusted when the number of games is changed ?
ah yea, i removed the SPSA defaults in the "create new test" redesign PR when all that should've been removed was the list of hard-coded params in the SPSA parameter list.
A as a percentage of # games makes sense. from what i'm reading, A is typically less than or equal to 10% the expected # of iterations (2 games per iteration). So maybe it could be either:
Haha, in all this time I never realised that A was (/ should be) related to the number of games! :)
Regarding SPSA at very low tc, does that stress the server a lot because workers are continually returning small batches of data?
@xoto10 the SPSA at very low tc can be also done locally :)
@linrock either percentage seems fine to me. Probably games, since we specify #games for SPSA and not number of iterations. In the future, I could imagine that an iteration contains more than 2 games (i.e. batching for SPSA, @vdbergh?), to reduce server load, and because it presumably makes sense (but I don't know the SPSA details).
@vondele I am working on a small PR to allow the server to set a batch_size. It is mainly for sprt but it will also work for spsa and fixed games although for those one may consider leaving it to the worker. We can see.
@ppigazzini : I am trying to understand how SPSA code is working and my knowledge is very weak. Nevermind, I am trying. In the file rundb.py, I find the following:
# Generate the next set of tuning parameters
iter_local = spsa['iter'] + 1 # assume at least one completed,
# and avoid division by zero
for param in spsa['params']:
c = param['c'] / iter_local ** spsa['gamma']
flip = 1 if random.getrandbits(1) else -1
result['w_params'].append({
'name': param['name'],
'value': self.spsa_param_clip_round(param, c * flip,
spsa['clipping'], spsa['rounding']),
'R': param['a'] / (spsa['A'] + iter_local) ** spsa['alpha'] / c ** 2,
'c': c,
'flip': flip,
})
result['b_params'].append({
'name': param['name'],
'value': self.spsa_param_clip_round(param, -c * flip, spsa['clipping'], spsa['rounding']),
})
# Update the current theta based on the results from the worker
# Worker wins/losses are always in terms of w_params
result = spsa_results['wins'] - spsa_results['losses']
summary = []
w_params = self.get_params(run['_id'], worker)
for idx, param in enumerate(spsa['params']):
R = w_params[idx]['R']
c = w_params[idx]['c']
flip = w_params[idx]['flip']
param['theta'] = self.spsa_param_clip_round(param, R * c * result * flip,
spsa['clipping'],
'deterministic')
if grow_summary:
summary.append({
'theta': param['theta'],
'R': R,
'c': c,
})
My questions are:
And sorry for these "technical questions" ...
Update : latest version of code
@MJZ1977 I think it is great somebody is looking at the implementation of SPSA. I'm still puzzled why our tuning attempts have such a low success rate (@linrock recent experience). I do think we need a very large number of games, as the Elo difference we're looking for are so small, and the parameters of SPSA are not obvious or automatic, but I also think we need to critically audit the actual implementation, just in case.
@MJZ1977 You should also look at the worker code to get the complete picture, at and below this line https://github.com/glinscott/fishtest/blob/db94846a0db8788fe8a8724678798dcc91d201e8/worker/games.py#L386
See https://github.com/zamar/spsa for the original implementation
- Are the results corresponding to a specified number of games for a worker
@MJZ1977 A worker plays batches of 2*N-CPU games (white/black alternating) and requests a parameter update from the server after every batch.
@vondele SPSA claims to minimize the number of function evaluations. Classic SPSA evaluates the function only at "variables_values_k+delta; variables_values_k-delta" for the gradient estimation, so SPSA obviously diverges with wrong delta. This is why I suggest to test locally the SPSA parameters with USTC before submitting to fishtest.
The one side SPSA computes the gradient with "variables_values_k+delta; variables_values_k", so having a CPU cost free function evaluation with variable_value_k it's possible to implement:
Neither policies can guarantee the convergence with bad delta, though. SPSA (and all gradient descent algorithms) works only to refine the starting values within the starting basin, to find better local maxima we should switch to global optimization algorithms based on function evaluations (Nelder-Mead, genetic etc.) to explore the space variables. https://en.wikipedia.org/wiki/Global_optimization
@tomtor : thank you for the links ! Update : removed
@ppigazzini concerning Nelder-Mead, I did work on interfacing cutechess games to the nevergrad suite of optimizers : https://github.com/vondele/nevergrad4sf and picked TBPSA which seems to be the recommended optimizer for noisy functions. I found it robust if given enough games (millions literally). Unfortunately, the optimized parameters seem very good at the TC they have been optimized (VSTC), but not transferable. Since I can't optimize at STC or LTC, it would need to be integrated in fishtest.... but I'm not able to do that (time and experience with the framework lacking atm).... if somebody wants to pick it up, I would be happy to help.
After making some tests, I think that one of principal problems is that the random parameter "flip" is only taking values +1 or -1 (please correct if I am wrong). So basically, fishtest always tries to change all variables at the same time. One improvement can be to take flip values from [+1, +0.1, -0.1, -1] for exemple. It corresponds to random division by 10. In this case, we will have some tests with only 1 or 2 variables changing. I think it is also easy to implement even if I don't have the knowledge to make it !
If we want to tune 1 constant, it would be nice if tuning could simply test the start value and N values either side (3? 5?) and then display a bar chart of the resulting performance. That might give us an easy to read clue as to whether there's a trend in in which values are better. We tend to do this manually atm but it seems easy for the tuner to do?
@MJZ1977
So basically, fishtest always tries to change all variables at the same time.
SPSA = Simultaneous perturbation stochastic approximation
One improvement can be to take flip values from [+1, +0.1, -0.1, -1] for exemple. It corresponds to random division by 10. In this case, we will have some tests with only 1 or 2 variables changing. I think it is also easy to implement even if I don't have the knowledge to make it !
Random [+1, -1] is the Rademacher distribution, you can use other distributions, but the result IMO will not change: we can get good fishtest gains from SPSA only for bad tuned parameters or when SPSA finds for serendipity a new local maximum.
SPSA, like other gradient algorithms, it'a a local optimization, useful to refine the starting values "without hopping from the starting basin".
@xoto10 you are talking about a global optimization algorithm, take a look to the @vondele work.
while the TBPSA might also work for global optimization (that's always hard), I don't think we're typically stuck in local minima. At least, I have never seen evidence of that. TBPSA seems to be just rather good of doing the right thing in the presence of noise, also in (a relatively small) number of dimensions. @xoto10 the bar chart will tell almost nothing in most cases, unless we do on the order of 240000 games per point (that's roughly 1Elo error, i.e. the typical gain from a tune).
I once did a scan of one parameter for one of the search parameter, and the graph is somewhere in a thread on github, which I can't find right now, and it looks like this:
I don't think we're typically stuck in local minima. At least, I have never seen evidence of that.
In that case (a proper implemented) SPSA should be able to find a better value, but in my first post I collected all my doubts about our SPSA implementation.
A simple proof is to set a blatant wrong value for a parameter (eg. Queen = 0.1 pawn, sorry I'm not a SF developer :) and view if our SPSA is able to recover a good value.
I made some tests since yesterday and come to the conclusion that SPSA is not working well actually because of too much noise in individual results. As an example to explain my thought, I take this simple example SPSA beginning with KnightSafeCheck = 590 https://tests.stockfishchess.org/tests/view/5ea9b5c469c5cb4e2aeb82fd SPRT master vs KnightSafeCheck = 590 https://tests.stockfishchess.org/tests/view/5eaa93b769c5cb4e2aeb8370 The best value should be KnightSafeCheck = ~790 like in master. SPSA is oscillating even if it seems increasing at the end. I use only 1 variable to avoid any bias.
The only solution to this is to make iterations for at least 200 games instead of 2N games. For example if the results are 60-40-100, it gives +20 to multiply by the same gradient. It is very different from multiplying "60" and "-40" by different gradients which clearly increases the noise. This is my opinion but I cannot be sure without making tests which are impossible now.
An improvement can be to add an SPSA parameter = minimum number of games per iteration instead of the default 2N
I think one cannot say anything without first measuring the Elo difference between 590 and 790. If it takes many games to just detect the difference one cannot expect spsa to magically wander from 590 to 790.
@MJZ1977 try with a blatant wrong value eg KnightSafeCheck = 790000. If the SPSA is not able to recover a value that makes sense (eg 2000) then:
that might not be such a good test... this could be so far off that e.g. the local gradient is zero, so progress won't be made. Maybe it could be started from 0. But I agree it is good to have an estimate of the Elo importance of the term as well, and probably picking a term with e.g. ~10Elo impact makes sense. (Maybe scaling initiative would be a candidate?)
I launched a test beginning with a value of 10. So, you confirm that SPSA can't make +/-5 ELO difference. It will not be easy to find a +5ELO patch :-).
that might not be such a good test... this could be so far off that e.g. the local gradient is zero, so progress won't be made.
@vondele you know exactly how much you are off, so setting the proper "c_k_end" the gradient should not be 0 (if that parameter has an effect at all). And here is (one of many) a problem of our SPSA implementation: we ask for "c_k_end" and not for "c_k_start" making very hard for developers to control the SPSA behaviour.
I give up with KnightSafeCheck tests because the sensitivity to this parameter is not obvious. I took another parameter to do tests : the weak squares multiplier in king danger formula which I will call "Coef1". Master value is 185, an I begin with a low value of 20. SPRT finish quickly with 850 wins (18.4%), 1064 losses (23%). ELO broad estimate -15 ELO. https://tests.stockfishchess.org/tests/view/5eaafe2209d25e8e5058167e SPSA with defaults values, 60k games: Coef 1 = 30 https://tests.stockfishchess.org/tests/view/5eab049c09d25e8e505816a0 SPSA with higher "c" and "a", 60k games: Coef 1 = 40 https://tests.stockfishchess.org/tests/view/5eab049c09d25e8e505816a0 SPSA with higher "c" and "a" and closer initial value 140 stalled quickly ?! https://tests.stockfishchess.org/tests/view/5eab146409d25e8e505816fe
So, as a first conclusion, it seems that SPSA is going to the good direction but convergence is slow and not linear (many oscillations). I don't think there is a major problem with our SPSA implementation but I hope we can improve it to have a finer tuned parameters. I will repeat my improvement suggestions (which is the topic of this thread): 1- Increase the size of game batches to at least ~200k games: will decrease the numeric random dispersion, 2- Take another distribution than [+1, -1] (Rademacher distribution), [+1, +0.1, -0.1, -1] can be a good candidate: it will mainly improve separating the parameters sensibility and have a finer multi-parameters tuning.
Nobody can say how much it can improve SPSA without testing them on chess games. I hope this helps :-)
I have looked a bit at the spsa implementation in fishtest and I see no obvious problems with it. Of course the devil is sometimes in the details.
In first approximation one can use simulation to study spsa behaviour. One can even extract the spsa code from Fishtest to make sure one runs exactly the same code (in particular the batching behavior).
The difficulty is to have a realistic loss function. One could start with the one provided by @vondele https://github.com/glinscott/fishtest/issues/535#issuecomment-621363937.
Some time ago I started thinking about SPSA and wrote this document
support_ornstein_uhlenbeck.pdf
My initial hope was to get information on good choices of hyper parameters but nothing obvious came out. So the above document is incomplete and in addition it says nothing about batching.
BTW, I agree with doing simulation to study spsa behavior. I'll report later today some up to date results for the coef1 variable by @MJZ1977 what Elo estimates are for various values, so that we can in principle make an accurate loss function. I guess that actually needs @vdbergh input, to do that correctly (i.e. so that the noise is realistic as well). I'll also use it as test for the nevergrad4sf so we can compare.
@vdbergh take a look at my corrections here: https://github.com/glinscott/fishtest/compare/master...ppigazzini:spsa_fix_clean_up
R=a/c**2
that make very difficult to set a starting value. Seminal SPSA paper suggest to use a=0.16
and this is working fine in my tests (I'm using a=0.2
)function_value=games_result/2
(?) but I'm using function_value=games_result*2*c
: in this way for the same wins-losses difference the gradient is independent by c
(as a derivative IMO should be)So, if we want to construct a loss function for experimenting. The coef1 of above, measured with 10k STC games per point gives me this:
evaluated: {'coef1': 7} score : 47.515 +- 0.642 Elo : -17.285 +- 4.475
evaluated: {'coef1': 11} score : 47.607 +- 0.644 Elo : -16.642 +- 4.484
evaluated: {'coef1': 14} score : 47.437 +- 0.638 Elo : -17.826 +- 4.447
evaluated: {'coef1': 20} score : 48.233 +- 0.639 Elo : -12.283 +- 4.449
evaluated: {'coef1': 25} score : 48.252 +- 0.640 Elo : -12.148 +- 4.455
evaluated: {'coef1': 28} score : 48.932 +- 0.640 Elo : -7.422 +- 4.452
evaluated: {'coef1': 33} score : 48.306 +- 0.641 Elo : -11.777 +- 4.460
evaluated: {'coef1': 38} score : 48.718 +- 0.635 Elo : -8.907 +- 4.418
evaluated: {'coef1': 43} score : 48.748 +- 0.633 Elo : -8.705 +- 4.401
evaluated: {'coef1': 46} score : 48.155 +- 0.637 Elo : -12.824 +- 4.434
evaluated: {'coef1': 46} score : 48.820 +- 0.630 Elo : -8.198 +- 4.383
evaluated: {'coef1': 47} score : 48.621 +- 0.631 Elo : -9.582 +- 4.388
evaluated: {'coef1': 51} score : 49.214 +- 0.635 Elo : -5.465 +- 4.413
evaluated: {'coef1': 54} score : 49.461 +- 0.632 Elo : -3.744 +- 4.393
evaluated: {'coef1': 56} score : 49.228 +- 0.632 Elo : -5.364 +- 4.392
evaluated: {'coef1': 61} score : 49.049 +- 0.634 Elo : -6.612 +- 4.408
evaluated: {'coef1': 65} score : 49.044 +- 0.627 Elo : -6.646 +- 4.360
evaluated: {'coef1': 67} score : 48.922 +- 0.626 Elo : -7.490 +- 4.355
evaluated: {'coef1': 83} score : 49.670 +- 0.622 Elo : -2.294 +- 4.319
evaluated: {'coef1': 83} score : 49.922 +- 0.622 Elo : -0.540 +- 4.323
evaluated: {'coef1': 89} score : 49.345 +- 0.633 Elo : -4.554 +- 4.398
evaluated: {'coef1': 91} score : 48.951 +- 0.629 Elo : -7.287 +- 4.376
evaluated: {'coef1': 92} score : 50.126 +- 0.623 Elo : 0.877 +- 4.331
evaluated: {'coef1': 93} score : 49.204 +- 0.626 Elo : -5.532 +- 4.353
evaluated: {'coef1': 98} score : 49.840 +- 0.630 Elo : -1.113 +- 4.380
evaluated: {'coef1': 102} score : 50.019 +- 0.627 Elo : 0.135 +- 4.357
evaluated: {'coef1': 102} score : 50.078 +- 0.623 Elo : 0.540 +- 4.331
evaluated: {'coef1': 103} score : 49.956 +- 0.628 Elo : -0.304 +- 4.362
evaluated: {'coef1': 103} score : 50.189 +- 0.619 Elo : 1.316 +- 4.304
evaluated: {'coef1': 105} score : 49.995 +- 0.632 Elo : -0.034 +- 4.390
evaluated: {'coef1': 107} score : 50.039 +- 0.624 Elo : 0.270 +- 4.334
evaluated: {'coef1': 107} score : 50.228 +- 0.621 Elo : 1.585 +- 4.317
evaluated: {'coef1': 109} score : 50.257 +- 0.628 Elo : 1.788 +- 4.361
evaluated: {'coef1': 115} score : 49.558 +- 0.610 Elo : -3.070 +- 4.242
evaluated: {'coef1': 115} score : 49.709 +- 0.621 Elo : -2.024 +- 4.314
evaluated: {'coef1': 115} score : 50.194 +- 0.618 Elo : 1.349 +- 4.292
evaluated: {'coef1': 116} score : 50.485 +- 0.625 Elo : 3.373 +- 4.343
evaluated: {'coef1': 117} score : 50.087 +- 0.619 Elo : 0.607 +- 4.301
evaluated: {'coef1': 121} score : 49.816 +- 0.614 Elo : -1.282 +- 4.266
evaluated: {'coef1': 122} score : 50.413 +- 0.622 Elo : 2.867 +- 4.320
evaluated: {'coef1': 124} score : 50.170 +- 0.620 Elo : 1.181 +- 4.306
evaluated: {'coef1': 125} score : 50.039 +- 0.618 Elo : 0.270 +- 4.293
evaluated: {'coef1': 126} score : 50.282 +- 0.616 Elo : 1.956 +- 4.281
evaluated: {'coef1': 130} score : 50.252 +- 0.610 Elo : 1.754 +- 4.238
evaluated: {'coef1': 131} score : 50.272 +- 0.614 Elo : 1.889 +- 4.265
evaluated: {'coef1': 135} score : 49.558 +- 0.623 Elo : -3.070 +- 4.332
evaluated: {'coef1': 137} score : 50.126 +- 0.618 Elo : 0.877 +- 4.298
evaluated: {'coef1': 138} score : 50.000 +- 0.618 Elo : -0.000 +- 4.297
evaluated: {'coef1': 139} score : 50.597 +- 0.620 Elo : 4.149 +- 4.312
evaluated: {'coef1': 139} score : 50.660 +- 0.614 Elo : 4.588 +- 4.265
evaluated: {'coef1': 141} score : 50.073 +- 0.617 Elo : 0.506 +- 4.284
evaluated: {'coef1': 143} score : 49.835 +- 0.627 Elo : -1.147 +- 4.357
evaluated: {'coef1': 144} score : 49.874 +- 0.621 Elo : -0.877 +- 4.313
evaluated: {'coef1': 145} score : 50.587 +- 0.617 Elo : 4.082 +- 4.289
evaluated: {'coef1': 147} score : 50.131 +- 0.608 Elo : 0.911 +- 4.222
evaluated: {'coef1': 150} score : 49.830 +- 0.616 Elo : -1.181 +- 4.279
evaluated: {'coef1': 150} score : 50.112 +- 0.619 Elo : 0.776 +- 4.298
evaluated: {'coef1': 156} score : 49.539 +- 0.621 Elo : -3.205 +- 4.319
evaluated: {'coef1': 158} score : 50.607 +- 0.616 Elo : 4.217 +- 4.280
evaluated: {'coef1': 159} score : 49.587 +- 0.620 Elo : -2.867 +- 4.307
evaluated: {'coef1': 160} score : 49.709 +- 0.612 Elo : -2.024 +- 4.253
evaluated: {'coef1': 160} score : 50.539 +- 0.622 Elo : 3.744 +- 4.326
evaluated: {'coef1': 162} score : 49.893 +- 0.615 Elo : -0.742 +- 4.275
evaluated: {'coef1': 165} score : 50.083 +- 0.621 Elo : 0.573 +- 4.313
evaluated: {'coef1': 168} score : 50.374 +- 0.616 Elo : 2.597 +- 4.282
evaluated: {'coef1': 168} score : 50.607 +- 0.621 Elo : 4.217 +- 4.317
evaluated: {'coef1': 168} score : 50.772 +- 0.622 Elo : 5.364 +- 4.324
evaluated: {'coef1': 169} score : 50.485 +- 0.623 Elo : 3.373 +- 4.327
evaluated: {'coef1': 170} score : 49.583 +- 0.616 Elo : -2.901 +- 4.280
evaluated: {'coef1': 171} score : 50.583 +- 0.616 Elo : 4.048 +- 4.280
evaluated: {'coef1': 172} score : 50.233 +- 0.619 Elo : 1.619 +- 4.303
evaluated: {'coef1': 172} score : 50.286 +- 0.618 Elo : 1.990 +- 4.292
evaluated: {'coef1': 173} score : 49.680 +- 0.617 Elo : -2.226 +- 4.286
evaluated: {'coef1': 173} score : 50.189 +- 0.616 Elo : 1.316 +- 4.282
evaluated: {'coef1': 173} score : 50.320 +- 0.617 Elo : 2.226 +- 4.291
evaluated: {'coef1': 174} score : 49.743 +- 0.616 Elo : -1.788 +- 4.280
evaluated: {'coef1': 177} score : 50.058 +- 0.621 Elo : 0.405 +- 4.318
evaluated: {'coef1': 178} score : 49.767 +- 0.618 Elo : -1.619 +- 4.297
evaluated: {'coef1': 178} score : 50.471 +- 0.622 Elo : 3.272 +- 4.326
evaluated: {'coef1': 178} score : 50.602 +- 0.619 Elo : 4.183 +- 4.304
evaluated: {'coef1': 179} score : 50.495 +- 0.624 Elo : 3.441 +- 4.334
evaluated: {'coef1': 181} score : 50.024 +- 0.616 Elo : 0.169 +- 4.279
evaluated: {'coef1': 183} score : 49.879 +- 0.610 Elo : -0.843 +- 4.241
evaluated: {'coef1': 183} score : 50.476 +- 0.623 Elo : 3.306 +- 4.327
evaluated: {'coef1': 186} score : 50.262 +- 0.620 Elo : 1.822 +- 4.311
evaluated: {'coef1': 186} score : 50.714 +- 0.620 Elo : 4.959 +- 4.311
evaluated: {'coef1': 191} score : 50.175 +- 0.621 Elo : 1.214 +- 4.315
evaluated: {'coef1': 192} score : 50.267 +- 0.616 Elo : 1.855 +- 4.280
evaluated: {'coef1': 199} score : 49.786 +- 0.613 Elo : -1.484 +- 4.260
evaluated: {'coef1': 203} score : 49.927 +- 0.620 Elo : -0.506 +- 4.312
evaluated: {'coef1': 204} score : 49.694 +- 0.624 Elo : -2.125 +- 4.339
evaluated: {'coef1': 205} score : 49.617 +- 0.613 Elo : -2.665 +- 4.263
evaluated: {'coef1': 208} score : 49.767 +- 0.622 Elo : -1.619 +- 4.322
evaluated: {'coef1': 212} score : 49.709 +- 0.621 Elo : -2.024 +- 4.315
evaluated: {'coef1': 228} score : 49.529 +- 0.622 Elo : -3.272 +- 4.324
evaluated: {'coef1': 230} score : 49.262 +- 0.621 Elo : -5.128 +- 4.316
evaluated: {'coef1': 232} score : 49.228 +- 0.623 Elo : -5.364 +- 4.333
evaluated: {'coef1': 242} score : 48.903 +- 0.621 Elo : -7.625 +- 4.316
evaluated: {'coef1': 252} score : 48.922 +- 0.626 Elo : -7.490 +- 4.354
evaluated: {'coef1': 275} score : 48.248 +- 0.631 Elo : -12.182 +- 4.388
evaluated: {'coef1': 391} score : 42.015 +- 0.652 Elo : -55.968 +- 4.647
For the nevergrad4sf optimizer this yields the following convergence:
optimal at iter 1 after 1 evaluation and 10300 games : {'coef1': 20}
optimal at iter 2 after 5 evaluations and 51500 games : {'coef1': 56}
optimal at iter 3 after 10 evaluations and 103000 games : {'coef1': 103}
optimal at iter 4 after 15 evaluations and 154500 games : {'coef1': 115}
optimal at iter 5 after 20 evaluations and 206000 games : {'coef1': 92}
optimal at iter 6 after 25 evaluations and 257500 games : {'coef1': 168}
optimal at iter 7 after 30 evaluations and 309000 games : {'coef1': 107}
optimal at iter 8 after 35 evaluations and 360500 games : {'coef1': 115}
optimal at iter 9 after 40 evaluations and 412000 games : {'coef1': 173}
optimal at iter 10 after 45 evaluations and 463500 games : {'coef1': 169}
optimal at iter 11 after 53 evaluations and 545900 games : {'coef1': 176}
optimal at iter 12 after 61 evaluations and 628300 games : {'coef1': 143}
optimal at iter 13 after 69 evaluations and 710700 games : {'coef1': 169}
optimal at iter 14 after 77 evaluations and 793100 games : {'coef1': 184}
optimal at iter 15 after 85 evaluations and 875500 games : {'coef1': 153}
optimal at iter 16 after 101 evaluations and 1040300 games : {'coef1': 155}
I'll try to update as I get more data (Edit: update 2).
To verify if 115 could be a good parameter, I've launched a test here: https://tests.stockfishchess.org/tests/view/5eac0b636ffeed51f6e321f4 idem for 153: https://tests.stockfishchess.org/tests/view/5eac1d2c6ffeed51f6e321fe
BTW, I agree with doing simulation to study spsa behavior. I'll report later today some up to date results for the coef1 variable by @MJZ1977 what Elo estimates are for various values, so that we can in principle make an accurate loss function. I guess that actually needs @vdbergh input, to do that correctly (i.e. so that the noise is realistic as well). I'll also use it as test for the nevergrad4sf so we can compare.
@vondele Probably you have already moved on.
In any case for simulation the only thing we have to do is to supply a realistic function
(params)-->Elo
With such a function one can simulate the outcome of the games that are used as input to spsa.
To do this one needs an Elo model to translate Elo differences into w,d,l. In first approximation a fixed draw ratio and no opening book bias would do I think.
More advanced would be to use the BayesElo model. But then one has to the translation
(Elo,draw_ratio,bias)-->(BayesElo,draw_elo,advantage)
. The SPRT simulator https://github.com/vdbergh/simul does this but the code is in C. I can extract it but not immediately.
I haven't moved on.... you're the expert for Elo -> game result simulation :-) Also, I don't know SPSA at all, so I'd be more than happy for you, @ppigazzini or @MJZ1977 to look into this..
So, that's the model from the latest data points. It pretty accurately is quadratic (i.e. cubic, quartic fits were equivalent). Elo(x) = 1.49643 - 1./2 * ((x - 151.148) / 23.6133) ** 2
From the two tests mentioned above: DrawElo (BayesElo) | 250.93 DrawElo (BayesElo) | 248.71 RMS bias (Elo) | 31.670 RMS bias (Elo) | -0.000 (?!)
I have looked a bit at the spsa implementation in fishtest and I see no obvious problems with it. Of course the devil is sometimes in the details.
IMO you described well the devil in your paper, p(k+1)=p(k)+a/c
:
IMO is more correct to use p(k+1)=p(k)+a
@vondele The statistical measurement of RMS bias needs a lot of games (and even then there are outliers). But for noob_3moves it is safe to take 30 (*) (I have been observing it for a long time).
Running simul ./simul --elo 0 --bias 30 --draw_ratio 0.61
gives
draw_elo = 250.3990
advantage = 48.3115
When using paired games it should be safe to consider advantage
as the advantage for white in the BayesElo model (although in reality it will not be).
For converting Elo to BayesElo I would multiply with the scale factor (de=draw_elo
).
def scale(de):
return (4*10**(-de/400))/(1+10**(-de/400))**2
(*) Actually for a placebo parameter it will be much higher. I do not know if this is relevant or not.
@vondele Actually I now realize the bias/advantage is irrelevant. spsa only uses scores. So one can just set the bias to zero.
One more data drop. I wanted to see what the effect of using VSTC (2+0.02) in this case was, so very similar data:
evaluated: {'coef1': 1} score : 48.694 +- 0.761 Elo : -9.076 +- 5.295
evaluated: {'coef1': 22} score : 49.107 +- 0.766 Elo : -6.207 +- 5.322
evaluated: {'coef1': 24} score : 49.544 +- 0.761 Elo : -3.171 +- 5.287
evaluated: {'coef1': 27} score : 48.752 +- 0.765 Elo : -8.671 +- 5.322
evaluated: {'coef1': 31} score : 49.481 +- 0.758 Elo : -3.609 +- 5.268
evaluated: {'coef1': 31} score : 49.961 +- 0.765 Elo : -0.270 +- 5.315
evaluated: {'coef1': 40} score : 49.893 +- 0.764 Elo : -0.742 +- 5.312
evaluated: {'coef1': 45} score : 49.233 +- 0.761 Elo : -5.330 +- 5.290
evaluated: {'coef1': 49} score : 49.451 +- 0.763 Elo : -3.812 +- 5.302
evaluated: {'coef1': 53} score : 49.252 +- 0.756 Elo : -5.195 +- 5.253
evaluated: {'coef1': 53} score : 49.451 +- 0.764 Elo : -3.812 +- 5.308
evaluated: {'coef1': 61} score : 49.806 +- 0.761 Elo : -1.349 +- 5.292
evaluated: {'coef1': 63} score : 49.956 +- 0.762 Elo : -0.304 +- 5.295
evaluated: {'coef1': 68} score : 50.403 +- 0.765 Elo : 2.800 +- 5.316
evaluated: {'coef1': 69} score : 50.587 +- 0.756 Elo : 4.082 +- 5.255
evaluated: {'coef1': 71} score : 49.723 +- 0.762 Elo : -1.923 +- 5.299
evaluated: {'coef1': 82} score : 50.417 +- 0.752 Elo : 2.901 +- 5.227
evaluated: {'coef1': 108} score : 50.311 +- 0.757 Elo : 2.159 +- 5.260
evaluated: {'coef1': 110} score : 50.238 +- 0.756 Elo : 1.653 +- 5.251
evaluated: {'coef1': 114} score : 50.544 +- 0.754 Elo : 3.778 +- 5.243
evaluated: {'coef1': 114} score : 50.587 +- 0.753 Elo : 4.082 +- 5.237
evaluated: {'coef1': 118} score : 50.058 +- 0.753 Elo : 0.405 +- 5.234
evaluated: {'coef1': 121} score : 51.053 +- 0.751 Elo : 7.321 +- 5.219
evaluated: {'coef1': 123} score : 50.442 +- 0.755 Elo : 3.070 +- 5.248
evaluated: {'coef1': 124} score : 50.000 +- 0.754 Elo : -0.000 +- 5.240
evaluated: {'coef1': 125} score : 50.524 +- 0.750 Elo : 3.643 +- 5.210
evaluated: {'coef1': 125} score : 50.854 +- 0.757 Elo : 5.937 +- 5.265
evaluated: {'coef1': 126} score : 50.505 +- 0.757 Elo : 3.508 +- 5.261
evaluated: {'coef1': 126} score : 51.024 +- 0.753 Elo : 7.118 +- 5.233
evaluated: {'coef1': 127} score : 49.354 +- 0.754 Elo : -4.487 +- 5.242
evaluated: {'coef1': 128} score : 50.087 +- 0.756 Elo : 0.607 +- 5.254
evaluated: {'coef1': 128} score : 50.714 +- 0.752 Elo : 4.959 +- 5.224
evaluated: {'coef1': 128} score : 51.184 +- 0.752 Elo : 8.232 +- 5.232
evaluated: {'coef1': 129} score : 49.461 +- 0.758 Elo : -3.744 +- 5.268
evaluated: {'coef1': 129} score : 50.083 +- 0.751 Elo : 0.573 +- 5.219
evaluated: {'coef1': 129} score : 50.772 +- 0.755 Elo : 5.364 +- 5.249
evaluated: {'coef1': 129} score : 51.350 +- 0.750 Elo : 9.380 +- 5.219
evaluated: {'coef1': 130} score : 50.262 +- 0.752 Elo : 1.822 +- 5.226
evaluated: {'coef1': 130} score : 50.296 +- 0.759 Elo : 2.058 +- 5.272
evaluated: {'coef1': 130} score : 50.563 +- 0.754 Elo : 3.913 +- 5.242
evaluated: {'coef1': 130} score : 50.718 +- 0.754 Elo : 4.993 +- 5.239
evaluated: {'coef1': 130} score : 51.083 +- 0.754 Elo : 7.523 +- 5.242
evaluated: {'coef1': 131} score : 50.282 +- 0.752 Elo : 1.956 +- 5.225
evaluated: {'coef1': 131} score : 50.364 +- 0.754 Elo : 2.530 +- 5.239
evaluated: {'coef1': 131} score : 50.874 +- 0.755 Elo : 6.072 +- 5.252
evaluated: {'coef1': 131} score : 50.995 +- 0.760 Elo : 6.916 +- 5.281
evaluated: {'coef1': 131} score : 51.403 +- 0.753 Elo : 9.751 +- 5.237
evaluated: {'coef1': 132} score : 50.330 +- 0.759 Elo : 2.294 +- 5.274
evaluated: {'coef1': 132} score : 50.515 +- 0.751 Elo : 3.576 +- 5.217
evaluated: {'coef1': 132} score : 50.621 +- 0.755 Elo : 4.318 +- 5.250
evaluated: {'coef1': 132} score : 50.801 +- 0.756 Elo : 5.566 +- 5.256
evaluated: {'coef1': 132} score : 51.233 +- 0.757 Elo : 8.570 +- 5.261
evaluated: {'coef1': 133} score : 50.005 +- 0.756 Elo : 0.034 +- 5.256
evaluated: {'coef1': 133} score : 50.024 +- 0.751 Elo : 0.169 +- 5.222
evaluated: {'coef1': 133} score : 50.092 +- 0.749 Elo : 0.641 +- 5.207
evaluated: {'coef1': 133} score : 50.403 +- 0.757 Elo : 2.800 +- 5.259
evaluated: {'coef1': 133} score : 50.485 +- 0.752 Elo : 3.373 +- 5.224
evaluated: {'coef1': 133} score : 50.893 +- 0.751 Elo : 6.207 +- 5.223
evaluated: {'coef1': 134} score : 49.869 +- 0.751 Elo : -0.911 +- 5.222
evaluated: {'coef1': 134} score : 50.160 +- 0.750 Elo : 1.113 +- 5.213
evaluated: {'coef1': 134} score : 50.272 +- 0.755 Elo : 1.889 +- 5.244
evaluated: {'coef1': 134} score : 50.277 +- 0.757 Elo : 1.923 +- 5.264
evaluated: {'coef1': 134} score : 50.282 +- 0.754 Elo : 1.956 +- 5.243
evaluated: {'coef1': 134} score : 51.092 +- 0.754 Elo : 7.591 +- 5.244
evaluated: {'coef1': 135} score : 50.141 +- 0.755 Elo : 0.978 +- 5.245
evaluated: {'coef1': 135} score : 50.296 +- 0.754 Elo : 2.058 +- 5.238
evaluated: {'coef1': 135} score : 50.340 +- 0.755 Elo : 2.361 +- 5.247
evaluated: {'coef1': 135} score : 50.519 +- 0.753 Elo : 3.609 +- 5.232
evaluated: {'coef1': 135} score : 51.447 +- 0.746 Elo : 10.055 +- 5.186
evaluated: {'coef1': 136} score : 48.971 +- 0.749 Elo : -7.152 +- 5.209
evaluated: {'coef1': 136} score : 50.267 +- 0.754 Elo : 1.855 +- 5.241
evaluated: {'coef1': 136} score : 50.316 +- 0.751 Elo : 2.193 +- 5.218
evaluated: {'coef1': 136} score : 50.466 +- 0.752 Elo : 3.238 +- 5.228
evaluated: {'coef1': 136} score : 50.665 +- 0.749 Elo : 4.622 +- 5.203
evaluated: {'coef1': 136} score : 50.835 +- 0.750 Elo : 5.802 +- 5.212
evaluated: {'coef1': 136} score : 51.097 +- 0.749 Elo : 7.625 +- 5.209
evaluated: {'coef1': 137} score : 50.083 +- 0.753 Elo : 0.573 +- 5.236
evaluated: {'coef1': 137} score : 50.505 +- 0.756 Elo : 3.508 +- 5.253
evaluated: {'coef1': 138} score : 50.286 +- 0.757 Elo : 1.990 +- 5.260
evaluated: {'coef1': 138} score : 50.291 +- 0.752 Elo : 2.024 +- 5.225
evaluated: {'coef1': 138} score : 50.369 +- 0.752 Elo : 2.564 +- 5.227
evaluated: {'coef1': 138} score : 50.607 +- 0.752 Elo : 4.217 +- 5.225
evaluated: {'coef1': 138} score : 50.961 +- 0.757 Elo : 6.680 +- 5.265
evaluated: {'coef1': 139} score : 50.612 +- 0.755 Elo : 4.250 +- 5.249
evaluated: {'coef1': 139} score : 50.650 +- 0.750 Elo : 4.520 +- 5.210
evaluated: {'coef1': 139} score : 50.699 +- 0.752 Elo : 4.858 +- 5.228
evaluated: {'coef1': 140} score : 50.000 +- 0.754 Elo : -0.000 +- 5.243
evaluated: {'coef1': 141} score : 50.578 +- 0.753 Elo : 4.014 +- 5.230
evaluated: {'coef1': 142} score : 50.447 +- 0.752 Elo : 3.103 +- 5.230
evaluated: {'coef1': 143} score : 49.840 +- 0.755 Elo : -1.113 +- 5.247
evaluated: {'coef1': 143} score : 50.083 +- 0.757 Elo : 0.573 +- 5.259
evaluated: {'coef1': 143} score : 50.330 +- 0.751 Elo : 2.294 +- 5.216
evaluated: {'coef1': 144} score : 50.461 +- 0.750 Elo : 3.205 +- 5.211
evaluated: {'coef1': 144} score : 50.820 +- 0.759 Elo : 5.701 +- 5.275
evaluated: {'coef1': 145} score : 50.388 +- 0.753 Elo : 2.699 +- 5.233
evaluated: {'coef1': 145} score : 50.748 +- 0.760 Elo : 5.195 +- 5.282
evaluated: {'coef1': 146} score : 50.034 +- 0.755 Elo : 0.236 +- 5.247
evaluated: {'coef1': 146} score : 50.937 +- 0.752 Elo : 6.511 +- 5.227
evaluated: {'coef1': 148} score : 50.718 +- 0.752 Elo : 4.993 +- 5.227
evaluated: {'coef1': 149} score : 49.699 +- 0.750 Elo : -2.091 +- 5.214
evaluated: {'coef1': 149} score : 50.000 +- 0.751 Elo : -0.000 +- 5.219
evaluated: {'coef1': 149} score : 50.587 +- 0.751 Elo : 4.082 +- 5.217
evaluated: {'coef1': 162} score : 50.631 +- 0.753 Elo : 4.385 +- 5.236
evaluated: {'coef1': 174} score : 50.233 +- 0.752 Elo : 1.619 +- 5.225
evaluated: {'coef1': 178} score : 50.340 +- 0.755 Elo : 2.361 +- 5.248
evaluated: {'coef1': 196} score : 50.029 +- 0.752 Elo : 0.202 +- 5.225
evaluated: {'coef1': 240} score : 49.476 +- 0.754 Elo : -3.643 +- 5.240
evaluated: {'coef1': 332} score : 45.500 +- 0.761 Elo : -31.354 +- 5.329
convergence of nevergrad4sf
optimal at iter 1 after 1 evaluation and 10300 games : {'coef1': 20}
optimal at iter 2 after 5 evaluations and 51500 games : {'coef1': 69}
optimal at iter 3 after 10 evaluations and 103000 games : {'coef1': 82}
optimal at iter 4 after 15 evaluations and 154500 games : {'coef1': 114}
optimal at iter 5 after 20 evaluations and 206000 games : {'coef1': 130}
optimal at iter 6 after 28 evaluations and 288400 games : {'coef1': 140}
optimal at iter 7 after 36 evaluations and 370800 games : {'coef1': 129}
optimal at iter 8 after 44 evaluations and 453200 games : {'coef1': 140}
optimal at iter 9 after 52 evaluations and 535600 games : {'coef1': 128}
optimal at iter 10 after 60 evaluations and 618000 games : {'coef1': 133}
optimal at iter 11 after 76 evaluations and 782800 games : {'coef1': 134}
optimal at iter 12 after 92 evaluations and 947600 games : {'coef1': 133}
optimal at iter 13 after 108 evaluations and 1112400 games : {'coef1': 135}
and the updated graph showing both data sets:
Edit: a VSTC SPRT test run on fishtest shows that 135 is indeed a better value at that TC: https://tests.stockfishchess.org/tests/view/5eac30ca6ffeed51f6e32208
@ppigazzini Well I find it hard to read what I wrote myself (it was just a quick draft).
However the second display on page 2 seems to suggest that the Fishtest implementation is correct. Compare with (6) in https://www.jhuapl.edu/SPSA/PDF-SPSA/Spall_Stochastic_Optimization.PDF
We want a
to be the learning rate. Of course there is a rather obscure extra factor u_1
which can however be explicitly computed (see Example 1.1).
EDIT: this is assuming that the batching doesn't do any harm.
@vdbergh SPSA is a simple gradient descent. The problem is the function we want to optimize and how we compute the derivative.
Fishtest now uses this function and derivative:
The problem is that 1. has values (for a couple of games) bounded in [+2; -2]: a derivative should not depend on the delta used to compute it. A minor problem is that 2. lacks a division by 2.
I propose to use this function and derivative:
@ppigazzini We want to do gradient ascent for the function f:params->Elo. On average wins-losses will be proportional to cf’ . So we are really computing the derivative of f. Stochastically.
Some time ago I started thinking about SPSA and wrote this document
support_ornstein_uhlenbeck.pdf
My initial hope was to get information on good choices of hyper parameters but nothing obvious came out. So the above document is incomplete and in addition it says nothing about batching.
@vdbergh : I tried to understand your paper, and I hope I catched the ideas. At the end, the perturbed system is converging to something equal to "unperturbed" + "integral of pertubations". the integral of perturbation is not surely equal to zero because a(t) and c(t) are varying (that what I have called numerical dispersion). But is is clear that if you decrease the pertubation, this integral is decreasing also. That is what can happens if we increase the batch.
@ppigazzini : I understand what you wrote, but I think that our SPSA is just saying "should I increase or decrease the parameter ?" and after this move by a/c. At the end, the most important is the direction and the parameter a/c.
@vondele : impressive curves ! I don't know how you have done all this in some hours.
Batching is ok it seems. But I wonder if it would be worthwhile to consider the batch as a single iteration and to normalise the result (dividing by games/2). Basically we are doing a more accurate measurement of the gradient of the Elo(params) function. I think batching is discussed in one of the spsa papers but I can’t find it now.
What I am proposing is called gradient averaging and it is discussed a bit in this paper.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.4562&rep=rep1&type=pdf
See Table 1. For very noisy observations, as in our case, gradient averaging is apparently advantageous.
Issue opened to collect info about possible future SPSA improvements.
SPSA references
SPSA is a fairly simple algorithm to be used for local optimization (not global optimization). The wiki has now a simple documentation to explain the SPSA implementation in fishtest Here is other documentation:
SPSA implementation problems/improvements
SPSA testing process (aka Time Control)
EDIT_000 this paragraph is outdated, I kept it to avoid disrupting the chain of posts:
that SPSA using or LTC or even ULTC has a high Signal/Noise ratio that helps the convergence. A ULTC match is very drawish, so in SPSA one side will win a pair of games only if the parameters random increments are somehow aligned with the gradient direction
I suggest this process to optimize the developer time and the framework CPU.
I took a SPSA from fishtest and run it locally changing only the the TC, the results are similar: