Open ppigazzini opened 4 years ago
In principle it would be easy to make the spsa batches fixed size (say 4 or 8) like for sprt and then treat these as single iterations.
EDIT. Hmm. It seems the gradient averaging idea and the reference were already discussed in the beginning of this thread. I missed this.
@vdbergh batch/iteration with 100 or 200 games and we have gradient averaging for free.
@ppigazzini I am adding some thoughts.
SPSA implementation problems/improvements
I didn't even know this. Asking for the final values is very strange indeed.
I have already replied to this. I think the Fishtest implementation is theoretically correct.
Maybe.
I think this is at the very least not very elegant, especially since N differs between the workers.
Yes. I think this might help a lot and would be trivial to do (just treat a batch as a single iteration and divide result by games/2). However the idea should be first validated by simulation (the reasonable values for R would change).
Not sure. The A parameter is just to regularize things in the beginning. I do not think its actual size matters a lot.
batch/iteration with 100 or 200 games and we have gradient averaging for free.
Yes that's what I have been saying. It is basically free. Batch sizes of 100 seems extreme but who knows? Maybe it is possible to say something theoretically about the best batch sizes (with the methods in my notes). Our observations are extremely noisy, but on the other hand the kind of noise we face is extremely well understood theoretically.
Simulation maybe easier though.
If we can use the central limit theorem https://en.wikipedia.org/wiki/Central_limit_theorem, we are dividing the noise by sq(N). If N=100, dividing noise by 10 is not a bad idea. After this, we must test and see if theory is always working :-)
@vdbergh I'm wondering if the SPSA test page shows wins losses for the "test SF" like SPRT/Fixed instead that wins losses for the "withe SF"
- we use "r_k = a_k / c_k^2" instead of "r_k = a_k"
I have already replied to this. I think the Fishtest implementation is theoretically correct.
@vdbergh you are right about the stochastic gradient f'=(wins-losses)/(2c) I don't understand why fishtest uses that artificial r_k instead that a_k. r_k=a_k/c^2 so the devs wast time trying random values without knowing that dependency.
@ppigazzini I do not find right rk so strange. We match p+c against p-c and then we change p to p +- r * c (for a single non draw game).
@vdbergh
The simple path was:
param = {
'name': chunks[0],
'theta': float(chunks[1]),
'min': float(chunks[2]),
'max': float(chunks[3]),
'c': float(chunks[4]),
'a': float(chunks[5]),
}
'a_k': param['a'] / (spsa['A'] + iter_local) ** spsa['alpha'] ,
param['theta'] = self.spsa_param_clip_round(param, a_k * result / (2 * c_k * flip))
def spsa_param_clip_round(self, param, increment):
return min(max(param['theta'] + increment, param['min']), param['max'])
and for reference when tuning, within SF, the dev will typically only specify the interval for the variable via SetRange(first, second), or if not specified implicitly SetRange(0, 2 * initial). This is automatically output with:
// Print formatted parameters, ready to be copy-pasted in fishtest
std::cout << n << ","
<< v << ","
<< r(v).first << "," << r(v).second << ","
<< (r(v).second - r(v).first) / 20.0 << ","
<< "0.0020"
<< std::endl;
So, actually, most devs might not really know the details of c_end, r_end, just assume that the defaults are reasonable.
@ppigazzini I think Joona’s idea was that the semantics of rk is very clear (it is dimensionless) but for ak it is less so. Normally ak is the learning rate but in our case we are not measuring the gradients of the loss function directly but only multiplied with a strange constant.
Just guessing.
Edit. Of course whether r or a is used internally by Fishtest does not matter for the user. So it might be indeed more sensible to use a since that is what is used in the literature.
Hmm. I was wondering what the purpose is of letting ck
go to zero. After all this just increases the noise in the gradient measurement (proportional to 1/ck
it seems). Then I realized that SPSA also has intrinsic noise when there is more than one parameter (it is after all stochastic). This intrinsic noise goes down when ck->0
. So decreasing ck
is necessary. Summarizing: we have
ck->0
.ck->0
.EDIT The claims about the intrinsic noise are wrong. See https://github.com/glinscott/fishtest/issues/535#issuecomment-622910640 below.
It seems not unreasonable to try to keep the measurement noise constant. This would mean that the batch size should be variable. More precisely the batch size should be proportional to 1/ck**2
.
By letting the worker do the gradient averaging we could also reduce the intrinsic noise. So the larger batch size wouldn't be wasted. Unfortunately this would be tricky since we could no longer rely on the concurrency in cutechess-cli (every two games would have different parameters).
These two ideas are basically orthogonal to each other.
Actually for SPSA we could make games_to_play
in the worker equal to 2. This would mean new parameters every two games. But the information that the server has to send (the new parameters) could be greatly compressed.
The advantage of this is that the server could do the gradient averaging (over batches proportional to 1/ck**2
).
EDIT. Ah no. Then we would not be doing concurrency in the worker. So putting some limited multithreading in the worker so that cutechess-cli would be restricted to playing single game pairs seems tempting. This is BTW how my own tester worked when I was still working on GnuChess.
I have a hard time to understand why the batch sizes would need to be so small. In this nevergrad4sf, I use 10000 or more games per batch. Clearly, too many games per batch will be poor eventually (no need to do e.g. >>1M games in our context), but as long as the noise >> signal, additional averaging can hardly be inefficient. A batch of just 2 games doesn't contain information to meaningfully adjust the parameters.
Edit: Basically, what I'm trying to say, we can use SPSA to optimize parameters for an objective function that is 'average score of a game pair' or 'average score of a 100 game match' and we'll find the same optimal parameters, and probably a similar efficiency.
@vondele A batch of just 2 games doesn't contain information to meaningfully adjust the parameters.
This is not what I am saying. What I am saying it to select stochastically a different flip (for every parameter) after every game pair.
Then after batch_size
games have been completed you would do the gradient averaging and compute new parameters. This would greatly reduce the intrinsic noise compared to the current implementation (which keeps the flips constant).
A second idea I was proposing was to make batch_size
variable as ck-->0
(to keep the measurement noise constant).
EDIT. What I am trying to say is that when discussing SPSA we should not confuse the noise intrinsic in the algorithm when there is more than one parameter (the S from "stochastic" in SPSA) and the measurement noise in our specific domain of chess engines testing. Sadly the current implementation deals poorly with both kinds of noise. For each kind of noise I am proposing a different remedy.
@vdbergh the two sources of noise are probably good to keep as separate concepts, but I'll need to read up on SPSA to understand what we're really talking about.
@vondele
I will try to explain it step by step (sorry if things below sound a bit too pedantic).
Suppose you have a function f(p)
of one variable. To estimate its derivative we calculate
(f(p+c)-f(p-c))/(2c) (*)
where only the absolute value of c
matters (the "flip
" discussed below is irrelevant).
If we can only measure f
noisily then (*)
is also noisy and the noise is proportional to 1/c
. This is the measurement noise.
In our domain we do not know f
(the function params-->Elo
) but we can (noisily) measure difference(f(p+c),f(p-c))
. This amounts to the same so below I neglect this distinction.
Assume now f
depends on two variables f(p,q)
and denote the partial derivatives by fp
and fq
. Assume that, as part of calculating gradient(f)
, we want to estimate fp
. We choose c1=c*flip1, c2=c*flip2
where flip1, flip2
are chosen uniformly randomly in the 2-element set {-1,1}
. To estimate fp
we calculate
(f(p+c1,q+c2)-f(p-c1,q-c2))/(2*c1) (**)
The first order approximation of this is
fp+(c2/c1)*fq
So we see that c2/c1*fq
now appears as a noise term in the estimation of fp
. This is the intrinsic noise. It is present even if we can measure f
accurately. The clever observation in SPSA is that the expectation value of the noise term c2/c1*fq
is zero. So the effect of the errors will average out.
By computing (**)
multiple times for different flips (without moving p,q
) and averaging we can reduce the intrinsic noise. The current Fishtest implementation is equivalent to (noisily) evaluating (**)
2*N
times but without changing the flips. This seems like a wasted opportunity but it would require multi threading in the worker to fix this.
Contrary to what I claimed above the intrinsic noise does not go down if c-->0
. So it seems the actual reason for letting c-->0
is to progressively reduce the error caused by neglecting higher order terms in the approximations.
EDITED (made the label (**)
point to a different equation).
Thanks for explaining! A few comments:
f
can just be a match of 200 games run on a worker for fixed values of p,q,c1,c2
in f(p+c1,q+c2)
, and the server can take care of the full SPSA algorithm, without loss of efficiency. Different values of p, q, flip1, or flip2, c
could all be handled by the server on this 'multi game f' the score of the 200 games match is enough info.c-->0
that sounds very risky. For finite difference estimates of derivatives, there is definitely a smallest value for c
(I have often seen it called h
) that can be used. I've seen it discussed in https://aip.scitation.org/doi/abs/10.1063/1.4822971 in the context of roundoff, but roundoff and noise are not too different. I guess we could use the total number of games to estimate roundoff/noise, and pick a reasonable fixed value of c. Also, so far, I have not seen higher order terms in f
to be very important (the example the STC data above shows this, quartic and cubic fits are pretty close, and even with ~1M games the difference between both is not statistically relevant). For me an optimizer gets only very little input namely an initial value and an interval [a,b]
which is (a few times) the characteristic length scale of the problem, and the interval where the optimum is expected to be. So for example for the coef1 above Elo(x) = 1.49643 - 1./2 * ((x - 151.148) / 23.6133) ** 2
I would say that [185 - 100, 185 + 100]
should the user input (an interval around the current value, consistent with the 23.6 lengthscale), and all the rest should be automatic. I assume that if we find the optimal settings for SPSA for quadratic functions, in the presence of noise of known magnitude, we should be able to have a near-optimal solution.
@vdbergh : in the expression fp+(c2/c1)*fq
, you count only on noise to reduce (c2/c1)*fq
. In current state of SPSA, p
and q
are moving and even c
and a
. This reduction is then not perfect and can takes many iterations.
With my second proposition of flip in [-1, -0.1, 0.1, 1], you will have more iterations with fp+(c2/c1)*fq
close to fp
. My feeling is that it will converge faster at the end.
@vondele
Maybe. But to me running a 100 match with constant flips seems to be sacrificing accuracy for no benefit (varying the flips will reduce the intrinsic error in the gradient estimation by sqrt(50) with no down side). The only limitation here is technical (the worker would have to be able to run cutechess-cli matches in parallel).
I agree with your misgivings for c-->0
but it is what the "official" SPSA algorithm does. Note that the effect is compensated by the fact that the learning rate (a
) also goes to zero and much faster. So in the end things turn out to be stable even in the presence of noise (one can see this clearly in the Fishtest implementation). Still I think in our setting there is little benefit in letting c-->0
since it basically just increases the measurement noise in the gradient estimation.
@MJZ1977
in the expression fp+(c2/c1)fq , you count only on noise to reduce (c2/c1)fq. In current state of SPSA, p and q are moving and even c and a.
Nothing is moving during a worker batch which currently has size 2*concurrency.
With my second proposition of flip in [-1, -0.1, 0.1, 1], you will have more iterations with fp+(c2/c1)*fq close to fp. My feeling is that it will converge faster at the end.
If I understand you correctly then that might be dangerous. You may end up with c2=c
, c1=0.1*c
so that c2/c1=10
. I.e. 10 times more noise in the fp
estimation. Using binary values for ci
is strongly encouraged in the SPSA articles.
Nothing is moving during a worker batch which currently has size 2*concurrency.
Yes I know, I am speaking outside batches. Inside batches, it will not be easy to change the flips randomly (for technical reasons as you said before).
With my second proposition of flip in [-1, -0.1, 0.1, 1], you will have more iterations with fp+(c2/c1)*fq close to fp. My feeling is that it will converge faster at the end.
If I understand you correctly then that might be dangerous. You may end up with
c2=c
,c1=0.1*c
so thatc2/c1=10
. I.e. 10 times more noise in thefp
estimation. Using binary values forci
is strongly encouraged in the SPSA articles.
It will not be dangerous because the gradient is re-multiplied by the flip so 0.1. At the end we will have sums of: fp + fc fp - fc fp/10 + fc fp/10 - fc fp + fc/10 fp - fc/10 But, OK it is just a feeling that with random sums this converges faster to fp. Let's forget it for instance.
@MJZ1977 It is dangerous because you blow up the intrinsic error in the gradient estimation. Gradient estimation is the theoretical basis for SPSA. However for a discrete choice of flips the algorithm will still converge, albeit slower. What is for example not allowed is to use a uniformly distributed flip in [-1,1]. Then the algorithm will not converge (at least the proof does not work, this is stated in the papers). But your argument (erroneous I think) why it is not dangerous would still apply.
Something else I do not really understand is the following. SPSA seems to be a serial algorithm in a rather essential way. But on Fishtest many workers cooperate. How does this work?
Something else I do not really understand is the following. SPSA seems to be a serial algorithm in a rather essential way. But on Fishtest many workers cooperate. How does this work?
Without reading the code I thought that in each round all workers have the same parameters in order to work serially. If this is not true then it's a greedy implementation that could find good values near the local optimum but that very hardly can converge to the local optimum. Perhaps Joona used SPSA only serially.
Something else I do not really understand is the following. SPSA seems to be a serial algorithm in a rather essential way. But on Fishtest many workers cooperate. How does this work?
For each SPSA run the fishtest server stores the current parameter values. When it receives a mini-batch (currently 2*N-concurrency games) of results from a worker it adjust the parameters, see:
and returns the adjusted values to the same worker for use in the next mini-batch.
So it is very simplistic. Increasing the size of mini-batches to the full 200 games would have the effect that workers play more games with an outdated set of parameters. With outdated I mean that they are not adjusted by the most recent worker updates. Also slower workers will less frequently see the updated results of other faster workers.
Perhaps Joona used SPSA only serially.
If I remember correctly Joona had a 4-core machine, which would be considered high-end in those days.
Looking at: https://github.com/zamar/spsa/blob/10b2a8266bbf1cdf9011f83d38e0d292767d07eb/spsa.pl#L250
one can see that the shared global parameters were adjusted after every played game pair for each thread.
Edit: that would mean that it was much more serial than the current fishtest implementation with hundreds of workers playing a mini-batch of eg 16 games.
Edit 2:
If this is not true then it's a greedy implementation that could find good values near the local optimum but that very hardly can converge to the local optimum.
Yes, it is a greedy implementation. Every worker is working on a different round.
@tomtor @ppigazzini Thanks. So it seems difficult to transplant the "pure" SPSA algorithm to the Fishtest setting. Some things that could be tried are
Treat a mini-batch as an iteration (hence a,c would decrease much slower).
Do not change the parameters after every minibatch but only if enough games have been received for these parameters (but what if these games are for outdated parameters?).
@tomtor I'm wondering if we could easily (without extra code to deal with old data) revert these PRs:
(f(clip(p+c))-f(clip(p-c)))/(2c)
@vondele
For me an optimizer gets only very little input namely an initial value and an interval [a,b] which is (a few times) the characteristic length scale of the problem, and the interval where the optimum is expected to be.
I called this "internal normalization of the variables values based upon the starting values and the bounds". This has several advantages:
@ppigazzini this 'internal normalization' would in that case be the right direction IMO.
concerning reverts etc. I think we should base changes on simulator results, for a simulator that is as close as possible to the real case, i.e. starting from something that generates game outcomes given Elo(x)/DrawElo/Bias (i.e. the model of our objective function). Probably also take into account the parallelism issue mentioned.
BTW, the nevergrad framework has standalone benchmarks, and I know also includes SPSA (but I have not tried it yet, since they find TBPSA to be best for noisy objective functions) https://facebookresearch.github.io/nevergrad/benchmarks.html.
I'm wondering if we could easily (without extra code to deal with old data) revert these PRs
@ppigazzini From a quick inspection I conclude that these PRs can be reverted without a need for fixup code to display these tests.
@vondele thank to your suggestion here https://github.com/glinscott/fishtest/pull/572#issuecomment-611967091 it's not possible anymore to use "careful clipping" and "randomize rounding", so we are simply talking to drop dead code :)
@vdbergh Joona's SPSA README https://github.com/zamar/spsa has some interesting info, take a look. By the way here the genesis of Rk (and IMO it makes no sense wrt Gk=ak/ck). Joona code plots all the variables during the iterations, fishtest asks for R_end and plots a_k ...
2. Modifications for chess engine testing
=========================================
Please see matlab pseudocode by Spall (document, p. 821) as a starting point.
For chess engine testing the following lines in the pseudocode
yplus = loss(thetaplus)
yminus = loss(thetaminus)
ghat = (yplus - yminus) / (2 * ck * delta)
are replaced with:
ghat = match(thetaplus, thetaminus) / (ck * delta)
match() plays a game between two identical engines.
If engine 1 (= thetaplus) wins, it returns 1.
If engine 2 (= thetaminus) wins, it returns -1.
If game is drawn, it returns 0.
For engine testing it makes sense to define a new variable "relative apply factor" Rk:
Rk := ak / ck^2.
It's interpretation is following:
theta := theta + ak * ghat
<=>
theta := theta + ak * match(thetaplus, thetaminus) / (ck * delta)
<=>
theta := theta + Rk * ck * match(thetaplus, thetaminus) / delta
Instead of trying to find ideal values for ak, we can try to find an ideal value for Rk.
What I am saying it to select stochastically a different flip (for every parameter) after every game pair.
Then after
batch_size
games have been completed you would do the gradient averaging and compute new parameters. This would greatly reduce the intrinsic noise compared to the current implementation (which keeps the flips constant).
@vdbergh recent paper about Parallel SPSA (only for gradient estimation) https://arxiv.org/pdf/1704.00223
The main problems with fishtest is how to deal with slow/stopped workers.
@ppigazzini I also have been thinking about the slow/stopped workers case. If we're averaging over flips or have larger batch sizes to compute, we can have a little of redundant calculations, e.g. give these tasks to 1.1 * N workers instead of N, and proceed as soon as we have N results back. There still might need to be a timeout mechanism, since there is no guarantee we get N results back.
@vondele I think your proposal is reasonable.
I read it like this.
{'task_alive' : False}
). For this to be not too wasteful the workers would have to call in frequently, e.g. after each completed game pair, as reported by cutechess-cli. Alternatively we may just accept that some workers will send results for outdated parameters, hoping that this does not affect the results too much.If a worker goes away then its task will migrate to another worker. So maybe a timeout is not necessary?
@ppigazzini Thanks for the reference. It is basically like Vondele's proposal. Except that it contains a smarter way to combine the results for different flips (I still think averaging is good enough, but...).
I am not sure about using the Hessian. There is so much noise in our measurements that we can hardly estimate the gradient. So estimating the Hessian seems daunting. On the other hand I have not looked at the algorithm so maybe they are smart.
If a worker goes away then its task will migrate to another worker. So maybe a timeout is not necessary?
If a worker is terminated cleanly (ctrl-c) it will report the shutdown:
But hard killed workers are cleaned up much later after 30 minutes by the server:
So the scavenge should be made a bit more aggresive/smarter. Eg cleanup after twice the time between the previous updates.
@ppigazzini I thought a bit more. The method they use for combining the result for different flips is a least square solution to a system of linear equations. It seems quite simple and we could easily use it in Fishtest (as we have numpy anyway).
@tomtor, concerning the scavenging, should we be more careful now with the batching, i.e. a slow (nps) 1 core worker with a batch might actually take some time. Also in light of the longest game I found on fishtest (902 plies, > 30min) https://github.com/protonspring/Stockfish/commit/ac61d37d733b65afb8716cfb011aa5982c1cb7e3#commitcomment-38734921 .
@ppigazzini I thought a bit more. The method they use for combining the result for different flips is a least square solution to a system of linear equations. It seems quite simple and we could easily use it in Fishtest (as we have numpy anyway).
@vdbergh Correct, it's a simple least square solution (and the dual version) and switching the sign of a single flip for column/row guarantees that the matrix is not singular. Very cleaver :) With totally random flips you should use a ridge regression to regularize the matrix.
@vondele I think for spsa we should set batch_size=2
(call update_task
every game pair, currently batch_size=2*concurrency
) and games_to_play=2*concurrency
like it is now. This should be suitable on average.
If we move to a new set of parameters we have to accept that occasionally some workers will be left behind (like the poor 1 core worker churning away at its 900 moves LTC game which will ultimately turn out to be useless). Unless we create a side channel for the server to talk to a worker independently of update_task
.
Creating such a side channel would be rather easy in principle. While parsing cutechess-cli output a worker could also call a separate server api every minute or so (or whatever interval is reasonable) to get side channel messages.
@ppigazzini It is actually a good question if there is any theoretical advantage in choosing the flips randomly (instead of following a predetermined pattern as in FDSA) if we do not move the parameters (except for ease of programming of course). I'd like to think there is...
@vdbergh This should be the reference paper for SPSA and deterministic perturbation: https://www.researchgate.net/publication/234814464_Two-Timescale_Simultaneous_Perturbation_Stochastic_Approximation_using_Deterministic_Perturbation_Sequences
Another paper (2006) with a smart remark: https://www.researchgate.net/publication/220343758_Universal_parameter_optimisation_in_games_based_on_SPSA
"The above claims are asymptotic by their nature. In this paper, on the other hand, we are primarily interested in the transient behaviour of the algorithms. This is because we do not expect them to converge within the allocated time-frame to a small neighbourhood of a stationary point, due to the inherent complexity of the optimisation task and the long running times that are typical for game-simulations. In the transient regime, according to some authors, FDSA might actually perform better than SPSA."
I will continue my review on recent practical works on stochastic recursive algorithms for optimization after some food :)
@ppigazzini Thanks. I haven't looked at the references yet but I did a quick check for 2 parameters and it seems that the propagation of the measurement noise to the gradient estimate, when choosing a random sequence, is actually slightly worse (but asymptotically for many flips it does not matter). For many parameters (>> batch size) the advantages of a random sequence are clear (there is simply no good deterministic sequence).
of course we're normally using pseudo-random and thus deterministic sequences... for (quasi) monte carlo sampling, people have been using Sobol sequences with some luck. BTW, I doubt we can really do SPSA optimizations on fishtest in very high dimensions... would already be cool if it worked for <10 params reliably.
@ppigazzini I looked at your second reference and it describes exactly the algorithm we are proposing to implement! See \S3.1 In particular we should vary the flips as much as we can and the batch size should increase with c!
I looked at your second reference and it describes exactly the algorithm we are proposing to implement! See \S3.1 In particular we should vary the flips as much as we can and the batch size should increase with c!
Is it possible to implement this in an agile way and change the batch size and parameter (flip) calculations, but leave the current simple minded concurrency implementation as is in the first implementation?
The reason I propose this, is that optimal efficiency is not our first concern. The current implementation seems to be unable to tune parameters unless it results in a gain of eg at least 5 Elo, even playing millions of games will not change that. If that is caused by the current concurrency implementation then we should obviously fix that first, but from the discussion I get the impression that that is not the main issue.
@tomtor If we don't mind that some outdated results are used in the gradient calculation then I don't think one needs a lot of code. Worker changes aren't even necessary. Basically on the server we would not change parameters immediately after every worker update but wait till a larger contingent of results is collected and then do a more refined gradient estimation. I can try to think about it but right now I am extremely busy in my professional life (the issue is not writing the code, which is usually trivial, but validating it so that one is confident that there are no bugs).
Issue opened to collect info about possible future SPSA improvements.
SPSA references
SPSA is a fairly simple algorithm to be used for local optimization (not global optimization). The wiki has now a simple documentation to explain the SPSA implementation in fishtest Here is other documentation:
SPSA implementation problems/improvements
SPSA testing process (aka Time Control)
EDIT_000 this paragraph is outdated, I kept it to avoid disrupting the chain of posts:
that SPSA using or LTC or even ULTC has a high Signal/Noise ratio that helps the convergence. A ULTC match is very drawish, so in SPSA one side will win a pair of games only if the parameters random increments are somehow aligned with the gradient direction
I suggest this process to optimize the developer time and the framework CPU.
I took a SPSA from fishtest and run it locally changing only the the TC, the results are similar: