SPSA improvements [RFC]

ppigazzini commented 4 years ago

Issue opened to collect info about possible future SPSA improvements.

SPSA references

SPSA is a fairly simple algorithm to be used for local optimization (not global optimization). The wiki has now a simple documentation to explain the SPSA implementation in fishtest Here is other documentation:

SPSA seminal paper cited by @zamar in the forum
SPSA website
wikipedia SPSA page

SPSA implementation problems/improvements

we ask for "c_k_end" and "r_k_end" (final parameters values), but IMO we should ask for for "c" and "r" (starting values) if those are too big the SPSA diverges
we use "r_k = a_k / c_k^2" instead of "r_k = a_k" (I searched unsuccessfully a reference in some SPSA papers)
we set "c_k_end" and "r_k_end" for any single variable to be optimized (the original SPSA uses global values): this makes sense to account for the different sensitivity of the variables, but IMO this should be dealt with an internal normalization of the variables values based upon the starting values and the bounds.
one iteration should be set to a 2 games for match, but our worker code cannot support this, so we set one iteration to a 2*N_cores games for match
compute an averaged SP gradient per iteration to lower the noise
we have experimental code (special rounding and clipping) that nobody use: I'm afraid that it's theoretically correct but not very useful for the rough way we use SPSA
"A" parameter should be computed from the number of games
the worker passes rounded values to cutechess-cli: we should normalize the variables values to have the same resolution for all the variables

SPSA testing process (aka Time Control)

EDIT_000 this paragraph is outdated, I kept it to avoid disrupting the chain of posts:

read the wiki for a SPSA description https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#tuning-with-spsa
the experience on these last years has shown that a very short time control on fishtest is not working:
- with NNUE, workers running on dual CPU have time losses at ultra short time control (USTC)
- that SPSA using or LTC or even ULTC has a high Signal/Noise ratio that helps the convergence. A ULTC match is very drawish, so in SPSA one side will win a pair of games only if the parameters random increments are somehow aligned with the gradient direction

I suggest this process to optimize the developer time and the framework CPU.

first steps: run some SPSAs at Ultra STC (e.g. 1+0.01) to find good "c_k_end", "r_k_end" values and some good variables starting values. This can be done or locally with a recent CPU or in fishtest.
last step: run a final SPSA in fishtest to optimize the variables for a longer TC (e.g. STC, 20+0.2, LTC etc.)

I took a SPSA from fishtest and run it locally changing only the the TC, the results are similar:

20+0.2 - original test on fishtest: https://tests.stockfishchess.org/tests/view/5e2dade6ab2d69d58394fb5e

20+02

2+0.02:

2+002

1+0.01::

1+001

0.5+0.01:

05+001

vdbergh commented 4 years ago

In principle it would be easy to make the spsa batches fixed size (say 4 or 8) like for sprt and then treat these as single iterations.

EDIT. Hmm. It seems the gradient averaging idea and the reference were already discussed in the beginning of this thread. I missed this.

ppigazzini commented 4 years ago

@vdbergh batch/iteration with 100 or 200 games and we have gradient averaging for free.

vdbergh commented 4 years ago

@ppigazzini I am adding some thoughts.

SPSA implementation problems/improvements

we ask for "c_k_end" and "r_k_end" (final parameters values), but IMO we should ask for for "c" and "r" (starting values) if those are too big the SPSA diverges.

I didn't even know this. Asking for the final values is very strange indeed.

we use "r_k = a_k / c_k^2" instead of "r_k = a_k" (I searched unsuccessfully a reference in some SPSA papers)

I have already replied to this. I think the Fishtest implementation is theoretically correct.

we set "c_k_end" and "r_k_end" for any single variable to be optimized (the original SPSA uses global values): this makes sense to account for the different sensitivity of the variables, but IMO this should be dealt with an internal normalization of the variables values based upon the starting values and the bounds.

Maybe.

one iteration should be set to a 2 games for match, but our worker code cannot support this, so we set one iteration to a 2*N_cores games for match

I think this is at the very least not very elegant, especially since N differs between the workers.

compute an averaged SP gradient per iteration to lower the noise

Yes. I think this might help a lot and would be trivial to do (just treat a batch as a single iteration and divide result by games/2). However the idea should be first validated by simulation (the reasonable values for R would change).

we have experimental code (special rounding and clipping) that nobody use: I'm afraid that it's theoretically correct but not very useful for the rough way we use SPSA
"A" parameter should be computed from the number of games

Not sure. The A parameter is just to regularize things in the beginning. I do not think its actual size matters a lot.

the worker passes rounded values to cutechess-cli: we should normalize the variables values to have the same resolution for all the variables

vdbergh commented 4 years ago

batch/iteration with 100 or 200 games and we have gradient averaging for free.

Yes that's what I have been saying. It is basically free. Batch sizes of 100 seems extreme but who knows? Maybe it is possible to say something theoretically about the best batch sizes (with the methods in my notes). Our observations are extremely noisy, but on the other hand the kind of noise we face is extremely well understood theoretically.

Simulation maybe easier though.

MJZ1977 commented 4 years ago

If we can use the central limit theorem https://en.wikipedia.org/wiki/Central_limit_theorem, we are dividing the noise by sq(N). If N=100, dividing noise by 10 is not a bad idea. After this, we must test and see if theory is always working :-)

ppigazzini commented 4 years ago

@vdbergh I'm wondering if the SPSA test page shows wins losses for the "test SF" like SPRT/Fixed instead that wins losses for the "withe SF"

ppigazzini commented 4 years ago

we use "r_k = a_k / c_k^2" instead of "r_k = a_k"

I have already replied to this. I think the Fishtest implementation is theoretically correct.

@vdbergh you are right about the stochastic gradient f'=(wins-losses)/(2c) I don't understand why fishtest uses that artificial r_k instead that a_k. r_k=a_k/c^2 so the devs wast time trying random values without knowing that dependency.

vdbergh commented 4 years ago

@ppigazzini I do not find right rk so strange. We match p+c against p-c and then we change p to p +- r * c (for a single non draw game).

ppigazzini commented 4 years ago

@vdbergh

in new test page we ask for "r_end" to compute "a_0": https://github.com/glinscott/fishtest/blob/4e0d9898a6ce0cbc1331059f087ff54517197499/fishtest/fishtest/views.py#L295-L306
then from "a_0" we compute "R_k": https://github.com/glinscott/fishtest/blob/4e0d9898a6ce0cbc1331059f087ff54517197499/fishtest/fishtest/rundb.py#L797
given "R_k=a_k/c^2" (?!?) we must compute the update with this puzzling thing: https://github.com/glinscott/fishtest/blob/4e0d9898a6ce0cbc1331059f087ff54517197499/fishtest/fishtest/rundb.py#L836-L838
and this mess: (disclaimer, I contributed at some of this mess...) https://github.com/glinscott/fishtest/blob/4e0d9898a6ce0cbc1331059f087ff54517197499/fishtest/fishtest/rundb.py#L712-L738

The simple path was:

ask for "c" and "a" (ie "c_0" and "a_0")

param = {
  'name': chunks[0],
  'theta': float(chunks[1]),
  'min': float(chunks[2]),
  'max': float(chunks[3]),
  'c': float(chunks[4]),
  'a': float(chunks[5]),
}

use directly "a_k" and drop "R_k":

'a_k': param['a'] / (spsa['A'] + iter_local) ** spsa['alpha'] ,

compute in only one place the increment using the equation reported by all SPSA papers:

  param['theta'] = self.spsa_param_clip_round(param, a_k * result / (2 * c_k * flip))

one-liner to simplify the clipping of the sum:

def spsa_param_clip_round(self, param, increment):
return min(max(param['theta'] + increment, param['min']), param['max'])

vondele commented 4 years ago

and for reference when tuning, within SF, the dev will typically only specify the interval for the variable via SetRange(first, second), or if not specified implicitly SetRange(0, 2 * initial). This is automatically output with:

  // Print formatted parameters, ready to be copy-pasted in fishtest
  std::cout << n << ","
            << v << ","
            << r(v).first << "," << r(v).second << ","
            << (r(v).second - r(v).first) / 20.0 << ","
            << "0.0020"
            << std::endl;

So, actually, most devs might not really know the details of c_end, r_end, just assume that the defaults are reasonable.

vdbergh commented 4 years ago

@ppigazzini I think Joona’s idea was that the semantics of rk is very clear (it is dimensionless) but for ak it is less so. Normally ak is the learning rate but in our case we are not measuring the gradients of the loss function directly but only multiplied with a strange constant.

Just guessing.

Edit. Of course whether r or a is used internally by Fishtest does not matter for the user. So it might be indeed more sensible to use a since that is what is used in the literature.

vdbergh commented 4 years ago

Hmm. I was wondering what the purpose is of letting ck go to zero. After all this just increases the noise in the gradient measurement (proportional to 1/ck it seems). Then I realized that SPSA also has intrinsic noise when there is more than one parameter (it is after all stochastic). This intrinsic noise goes down when ck->0. So decreasing ck is necessary. Summarizing: we have

Measurement noise. Goes up when ck->0.
Intrinsic noise. Goes down when ck->0.

EDIT The claims about the intrinsic noise are wrong. See https://github.com/glinscott/fishtest/issues/535#issuecomment-622910640 below.

It seems not unreasonable to try to keep the measurement noise constant. This would mean that the batch size should be variable. More precisely the batch size should be proportional to 1/ck**2.

By letting the worker do the gradient averaging we could also reduce the intrinsic noise. So the larger batch size wouldn't be wasted. Unfortunately this would be tricky since we could no longer rely on the concurrency in cutechess-cli (every two games would have different parameters).

These two ideas are basically orthogonal to each other.

vdbergh commented 4 years ago

Actually for SPSA we could make games_to_play in the worker equal to 2. This would mean new parameters every two games. But the information that the server has to send (the new parameters) could be greatly compressed.

The advantage of this is that the server could do the gradient averaging (over batches proportional to 1/ck**2).

EDIT. Ah no. Then we would not be doing concurrency in the worker. So putting some limited multithreading in the worker so that cutechess-cli would be restricted to playing single game pairs seems tempting. This is BTW how my own tester worked when I was still working on GnuChess.

vondele commented 4 years ago

I have a hard time to understand why the batch sizes would need to be so small. In this nevergrad4sf, I use 10000 or more games per batch. Clearly, too many games per batch will be poor eventually (no need to do e.g. >>1M games in our context), but as long as the noise >> signal, additional averaging can hardly be inefficient. A batch of just 2 games doesn't contain information to meaningfully adjust the parameters.

Edit: Basically, what I'm trying to say, we can use SPSA to optimize parameters for an objective function that is 'average score of a game pair' or 'average score of a 100 game match' and we'll find the same optimal parameters, and probably a similar efficiency.

vdbergh commented 4 years ago

@vondele A batch of just 2 games doesn't contain information to meaningfully adjust the parameters.

This is not what I am saying. What I am saying it to select stochastically a different flip (for every parameter) after every game pair.

Then after batch_size games have been completed you would do the gradient averaging and compute new parameters. This would greatly reduce the intrinsic noise compared to the current implementation (which keeps the flips constant).

A second idea I was proposing was to make batch_size variable as ck-->0 (to keep the measurement noise constant).

EDIT. What I am trying to say is that when discussing SPSA we should not confuse the noise intrinsic in the algorithm when there is more than one parameter (the S from "stochastic" in SPSA) and the measurement noise in our specific domain of chess engines testing. Sadly the current implementation deals poorly with both kinds of noise. For each kind of noise I am proposing a different remedy.

vondele commented 4 years ago

@vdbergh the two sources of noise are probably good to keep as separate concepts, but I'll need to read up on SPSA to understand what we're really talking about.

vdbergh commented 4 years ago

@vondele

I will try to explain it step by step (sorry if things below sound a bit too pedantic).

Suppose you have a function f(p) of one variable. To estimate its derivative we calculate
```
(f(p+c)-f(p-c))/(2c)   (*)
```
where only the absolute value of c matters (the "flip" discussed below is irrelevant).
If we can only measure f noisily then (*) is also noisy and the noise is proportional to 1/c. This is the measurement noise.
In our domain we do not know f (the function params-->Elo) but we can (noisily) measure difference(f(p+c),f(p-c)). This amounts to the same so below I neglect this distinction.
Assume now f depends on two variables f(p,q) and denote the partial derivatives by fp and fq. Assume that, as part of calculating gradient(f), we want to estimate fp. We choose c1=c*flip1, c2=c*flip2 where flip1, flip2 are chosen uniformly randomly in the 2-element set {-1,1}. To estimate fp we calculate
```
(f(p+c1,q+c2)-f(p-c1,q-c2))/(2*c1) (**)
```
The first order approximation of this is
```
fp+(c2/c1)*fq  
```
So we see that c2/c1*fq now appears as a noise term in the estimation of fp. This is the intrinsic noise. It is present even if we can measure f accurately. The clever observation in SPSA is that the expectation value of the noise term c2/c1*fq is zero. So the effect of the errors will average out.
By computing (**) multiple times for different flips (without moving p,q) and averaging we can reduce the intrinsic noise. The current Fishtest implementation is equivalent to (noisily) evaluating (**) 2*N times but without changing the flips. This seems like a wasted opportunity but it would require multi threading in the worker to fix this.
Contrary to what I claimed above the intrinsic noise does not go down if c-->0. So it seems the actual reason for letting c-->0 is to progressively reduce the error caused by neglecting higher order terms in the approximations.

EDITED (made the label (**) point to a different equation).

vondele commented 4 years ago

Thanks for explaining! A few comments:

I think in this language, my remark is f can just be a match of 200 games run on a worker for fixed values of p,q,c1,c2 in f(p+c1,q+c2), and the server can take care of the full SPSA algorithm, without loss of efficiency. Different values of p, q, flip1, or flip2, c could all be handled by the server on this 'multi game f' the score of the 200 games match is enough info.
Concerning c-->0 that sounds very risky. For finite difference estimates of derivatives, there is definitely a smallest value for c (I have often seen it called h) that can be used. I've seen it discussed in https://aip.scitation.org/doi/abs/10.1063/1.4822971 in the context of roundoff, but roundoff and noise are not too different. I guess we could use the total number of games to estimate roundoff/noise, and pick a reasonable fixed value of c. Also, so far, I have not seen higher order terms in f to be very important (the example the STC data above shows this, quartic and cubic fits are pretty close, and even with ~1M games the difference between both is not statistically relevant).

For me an optimizer gets only very little input namely an initial value and an interval [a,b] which is (a few times) the characteristic length scale of the problem, and the interval where the optimum is expected to be. So for example for the coef1 above Elo(x) = 1.49643 - 1./2 * ((x - 151.148) / 23.6133) ** 2 I would say that [185 - 100, 185 + 100] should the user input (an interval around the current value, consistent with the 23.6 lengthscale), and all the rest should be automatic. I assume that if we find the optimal settings for SPSA for quadratic functions, in the presence of noise of known magnitude, we should be able to have a near-optimal solution.

MJZ1977 commented 4 years ago

@vdbergh : in the expression fp+(c2/c1)*fq, you count only on noise to reduce (c2/c1)*fq. In current state of SPSA, p and q are moving and even c and a. This reduction is then not perfect and can takes many iterations. With my second proposition of flip in [-1, -0.1, 0.1, 1], you will have more iterations with fp+(c2/c1)*fq close to fp. My feeling is that it will converge faster at the end.

vdbergh commented 4 years ago

@vondele

Maybe. But to me running a 100 match with constant flips seems to be sacrificing accuracy for no benefit (varying the flips will reduce the intrinsic error in the gradient estimation by sqrt(50) with no down side). The only limitation here is technical (the worker would have to be able to run cutechess-cli matches in parallel).
I agree with your misgivings for c-->0 but it is what the "official" SPSA algorithm does. Note that the effect is compensated by the fact that the learning rate (a) also goes to zero and much faster. So in the end things turn out to be stable even in the presence of noise (one can see this clearly in the Fishtest implementation). Still I think in our setting there is little benefit in letting c-->0 since it basically just increases the measurement noise in the gradient estimation.

vdbergh commented 4 years ago

@MJZ1977

in the expression fp+(c2/c1)fq , you count only on noise to reduce (c2/c1)fq. In current state of SPSA, p and q are moving and even c and a.

Nothing is moving during a worker batch which currently has size 2*concurrency.

With my second proposition of flip in [-1, -0.1, 0.1, 1], you will have more iterations with fp+(c2/c1)*fq close to fp. My feeling is that it will converge faster at the end.

If I understand you correctly then that might be dangerous. You may end up with c2=c, c1=0.1*c so that c2/c1=10. I.e. 10 times more noise in the fp estimation. Using binary values for ci is strongly encouraged in the SPSA articles.

MJZ1977 commented 4 years ago

Nothing is moving during a worker batch which currently has size 2*concurrency.

Yes I know, I am speaking outside batches. Inside batches, it will not be easy to change the flips randomly (for technical reasons as you said before).

With my second proposition of flip in [-1, -0.1, 0.1, 1], you will have more iterations with fp+(c2/c1)*fq close to fp. My feeling is that it will converge faster at the end.

If I understand you correctly then that might be dangerous. You may end up with c2=c, c1=0.1*c so that c2/c1=10. I.e. 10 times more noise in the fp estimation. Using binary values for ci is strongly encouraged in the SPSA articles.

It will not be dangerous because the gradient is re-multiplied by the flip so 0.1. At the end we will have sums of: fp + fc fp - fc fp/10 + fc fp/10 - fc fp + fc/10 fp - fc/10 But, OK it is just a feeling that with random sums this converges faster to fp. Let's forget it for instance.

vdbergh commented 4 years ago

@MJZ1977 It is dangerous because you blow up the intrinsic error in the gradient estimation. Gradient estimation is the theoretical basis for SPSA. However for a discrete choice of flips the algorithm will still converge, albeit slower. What is for example not allowed is to use a uniformly distributed flip in [-1,1]. Then the algorithm will not converge (at least the proof does not work, this is stated in the papers). But your argument (erroneous I think) why it is not dangerous would still apply.

vdbergh commented 4 years ago

Something else I do not really understand is the following. SPSA seems to be a serial algorithm in a rather essential way. But on Fishtest many workers cooperate. How does this work?

ppigazzini commented 4 years ago

Something else I do not really understand is the following. SPSA seems to be a serial algorithm in a rather essential way. But on Fishtest many workers cooperate. How does this work?

Without reading the code I thought that in each round all workers have the same parameters in order to work serially. If this is not true then it's a greedy implementation that could find good values near the local optimum but that very hardly can converge to the local optimum. Perhaps Joona used SPSA only serially.

tomtor commented 4 years ago

Something else I do not really understand is the following. SPSA seems to be a serial algorithm in a rather essential way. But on Fishtest many workers cooperate. How does this work?

For each SPSA run the fishtest server stores the current parameter values. When it receives a mini-batch (currently 2*N-concurrency games) of results from a worker it adjust the parameters, see:

https://github.com/glinscott/fishtest/blob/4e0d9898a6ce0cbc1331059f087ff54517197499/fishtest/fishtest/rundb.py#L808

and returns the adjusted values to the same worker for use in the next mini-batch.

So it is very simplistic. Increasing the size of mini-batches to the full 200 games would have the effect that workers play more games with an outdated set of parameters. With outdated I mean that they are not adjusted by the most recent worker updates. Also slower workers will less frequently see the updated results of other faster workers.

tomtor commented 4 years ago

Perhaps Joona used SPSA only serially.

If I remember correctly Joona had a 4-core machine, which would be considered high-end in those days.

Looking at: https://github.com/zamar/spsa/blob/10b2a8266bbf1cdf9011f83d38e0d292767d07eb/spsa.pl#L250

one can see that the shared global parameters were adjusted after every played game pair for each thread.

Edit: that would mean that it was much more serial than the current fishtest implementation with hundreds of workers playing a mini-batch of eg 16 games.

Edit 2:

If this is not true then it's a greedy implementation that could find good values near the local optimum but that very hardly can converge to the local optimum.

Yes, it is a greedy implementation. Every worker is working on a different round.

vdbergh commented 4 years ago

@tomtor @ppigazzini Thanks. So it seems difficult to transplant the "pure" SPSA algorithm to the Fishtest setting. Some things that could be tried are

Treat a mini-batch as an iteration (hence a,c would decrease much slower).
Do not change the parameters after every minibatch but only if enough games have been received for these parameters (but what if these games are for outdated parameters?).

ppigazzini commented 4 years ago

@tomtor I'm wondering if we could easily (without extra code to deal with old data) revert these PRs:

https://github.com/glinscott/fishtest/commit/7eebda7e6d1f47f2672aefe46db35baee7cb5b1f : "careful clipping" solves the problem to compute the gradient in proper symmetrical way when "p" approaches a boundary, but IMO this should be solved limiting the range of "p" in order to be sure that (p+c) and (p-c) are always within the boundaries. "careful clipping" lower "c" introducing these problems:
- wrong gradient estimation: (f(clip(p+c))-f(clip(p-c)))/(2c)
- raising of measurement noise in the gradient estimation
- breaking (in some senses) of the rule that "c" must converge to zero slowly wrt "a"
https://github.com/glinscott/fishtest/commit/5f63500db3f40569ea406a8b8b4b987f054ee79f : "randomized rounding" is used to round stochastically (p+c) and (p-c) (cutechess-cli should accept only integers). This makes sense only for single ditig "p" and "c" but IMO this case should be dealt normalizing "p" to avoid the effect of the rounding.

@vondele

For me an optimizer gets only very little input namely an initial value and an interval [a,b] which is (a few times) the characteristic length scale of the problem, and the interval where the optimum is expected to be.

I called this "internal normalization of the variables values based upon the starting values and the bounds". This has several advantages:

deals with the cutechess-cli rounding
global "c" and "a" for all the variables
optimization of variables with very different ranges
simplifies the devs work

vondele commented 4 years ago

@ppigazzini this 'internal normalization' would in that case be the right direction IMO.

concerning reverts etc. I think we should base changes on simulator results, for a simulator that is as close as possible to the real case, i.e. starting from something that generates game outcomes given Elo(x)/DrawElo/Bias (i.e. the model of our objective function). Probably also take into account the parallelism issue mentioned.

BTW, the nevergrad framework has standalone benchmarks, and I know also includes SPSA (but I have not tried it yet, since they find TBPSA to be best for noisy objective functions) https://facebookresearch.github.io/nevergrad/benchmarks.html.

tomtor commented 4 years ago

I'm wondering if we could easily (without extra code to deal with old data) revert these PRs

@ppigazzini From a quick inspection I conclude that these PRs can be reverted without a need for fixup code to display these tests.

ppigazzini commented 4 years ago

@vondele thank to your suggestion here https://github.com/glinscott/fishtest/pull/572#issuecomment-611967091 it's not possible anymore to use "careful clipping" and "randomize rounding", so we are simply talking to drop dead code :)

ppigazzini commented 4 years ago

@vdbergh Joona's SPSA README https://github.com/zamar/spsa has some interesting info, take a look. By the way here the genesis of Rk (and IMO it makes no sense wrt Gk=ak/ck). Joona code plots all the variables during the iterations, fishtest asks for R_end and plots a_k ...

2. Modifications for chess engine testing
=========================================

Please see matlab pseudocode by Spall (document, p. 821) as a starting point.

For chess engine testing the following lines in the pseudocode

yplus  = loss(thetaplus)
yminus = loss(thetaminus)
ghat = (yplus - yminus) / (2 * ck * delta)

are replaced with:

ghat = match(thetaplus, thetaminus) / (ck * delta)

match() plays a game between two identical engines.
If engine 1 (= thetaplus) wins, it returns 1.
If engine 2 (= thetaminus) wins, it returns -1.
If game is drawn, it returns 0.

For engine testing it makes sense to define a new variable "relative apply factor" Rk:
Rk := ak / ck^2.

It's interpretation is following:

theta := theta + ak * ghat
<=>
theta := theta + ak * match(thetaplus, thetaminus) / (ck * delta)
<=>
theta := theta + Rk * ck * match(thetaplus, thetaminus) / delta

Instead of trying to find ideal values for ak, we can try to find an ideal value for Rk.

ppigazzini commented 4 years ago

What I am saying it to select stochastically a different flip (for every parameter) after every game pair.

Then after batch_size games have been completed you would do the gradient averaging and compute new parameters. This would greatly reduce the intrinsic noise compared to the current implementation (which keeps the flips constant).

@vdbergh recent paper about Parallel SPSA (only for gradient estimation) https://arxiv.org/pdf/1704.00223

first order SPSA (algorithm 1 pag 4): nice gradient estimation
second order SPSA (algorithm 2 pag 9): hessian estimation ~~(don't seems applicable in our case)~~ EDIT: reading with better attention we should be able to use also this in fishtest :)

The main problems with fishtest is how to deal with slow/stopped workers.

vondele commented 4 years ago

@ppigazzini I also have been thinking about the slow/stopped workers case. If we're averaging over flips or have larger batch sizes to compute, we can have a little of redundant calculations, e.g. give these tasks to 1.1 * N workers instead of N, and proceed as soon as we have N results back. There still might need to be a timeout mechanism, since there is no guarantee we get N results back.

vdbergh commented 4 years ago

@vondele I think your proposal is reasonable.

I read it like this.

Assign workers to a given set of parameters (each worker a different set of flips, a worker may call in multiple times to get new flips).
If enough games have been received for a given set of parameters (1000? ideally dependent on current c but maybe not initially) calculate the gradient and assign new parameters. Tell the workers to stop working on the previous parameters ({'task_alive' : False}). For this to be not too wasteful the workers would have to call in frequently, e.g. after each completed game pair, as reported by cutechess-cli. Alternatively we may just accept that some workers will send results for outdated parameters, hoping that this does not affect the results too much.
Go back to 1.

If a worker goes away then its task will migrate to another worker. So maybe a timeout is not necessary?

vdbergh commented 4 years ago

@ppigazzini Thanks for the reference. It is basically like Vondele's proposal. Except that it contains a smarter way to combine the results for different flips (I still think averaging is good enough, but...).

I am not sure about using the Hessian. There is so much noise in our measurements that we can hardly estimate the gradient. So estimating the Hessian seems daunting. On the other hand I have not looked at the algorithm so maybe they are smart.

tomtor commented 4 years ago

If a worker goes away then its task will migrate to another worker. So maybe a timeout is not necessary?

If a worker is terminated cleanly (ctrl-c) it will report the shutdown:

https://github.com/glinscott/fishtest/blob/4e0d9898a6ce0cbc1331059f087ff54517197499/worker/worker.py#L144

But hard killed workers are cleaned up much later after 30 minutes by the server:

https://github.com/glinscott/fishtest/blob/4e0d9898a6ce0cbc1331059f087ff54517197499/fishtest/fishtest/rundb.py#L264

So the scavenge should be made a bit more aggresive/smarter. Eg cleanup after twice the time between the previous updates.

vdbergh commented 4 years ago

@ppigazzini I thought a bit more. The method they use for combining the result for different flips is a least square solution to a system of linear equations. It seems quite simple and we could easily use it in Fishtest (as we have numpy anyway).

vondele commented 4 years ago

@tomtor, concerning the scavenging, should we be more careful now with the batching, i.e. a slow (nps) 1 core worker with a batch might actually take some time. Also in light of the longest game I found on fishtest (902 plies, > 30min) https://github.com/protonspring/Stockfish/commit/ac61d37d733b65afb8716cfb011aa5982c1cb7e3#commitcomment-38734921 .

ppigazzini commented 4 years ago

@ppigazzini I thought a bit more. The method they use for combining the result for different flips is a least square solution to a system of linear equations. It seems quite simple and we could easily use it in Fishtest (as we have numpy anyway).

@vdbergh Correct, it's a simple least square solution (and the dual version) and switching the sign of a single flip for column/row guarantees that the matrix is not singular. Very cleaver :) With totally random flips you should use a ridge regression to regularize the matrix.

vdbergh commented 4 years ago

@vondele I think for spsa we should set batch_size=2 (call update_task every game pair, currently batch_size=2*concurrency) and games_to_play=2*concurrency like it is now. This should be suitable on average.

If we move to a new set of parameters we have to accept that occasionally some workers will be left behind (like the poor 1 core worker churning away at its 900 moves LTC game which will ultimately turn out to be useless). Unless we create a side channel for the server to talk to a worker independently of update_task.

vdbergh commented 4 years ago

Creating such a side channel would be rather easy in principle. While parsing cutechess-cli output a worker could also call a separate server api every minute or so (or whatever interval is reasonable) to get side channel messages.

vdbergh commented 4 years ago

@ppigazzini It is actually a good question if there is any theoretical advantage in choosing the flips randomly (instead of following a predetermined pattern as in FDSA) if we do not move the parameters (except for ease of programming of course). I'd like to think there is...

ppigazzini commented 4 years ago

@vdbergh This should be the reference paper for SPSA and deterministic perturbation: https://www.researchgate.net/publication/234814464_Two-Timescale_Simultaneous_Perturbation_Stochastic_Approximation_using_Deterministic_Perturbation_Sequences

Another paper (2006) with a smart remark: https://www.researchgate.net/publication/220343758_Universal_parameter_optimisation_in_games_based_on_SPSA

"The above claims are asymptotic by their nature. In this paper, on the other hand, we are primarily interested in the transient behaviour of the algorithms. This is because we do not expect them to converge within the allocated time-frame to a small neighbourhood of a stationary point, due to the inherent complexity of the optimisation task and the long running times that are typical for game-simulations. In the transient regime, according to some authors, FDSA might actually perform better than SPSA."

I will continue my review on recent practical works on stochastic recursive algorithms for optimization after some food :)

vdbergh commented 4 years ago

@ppigazzini Thanks. I haven't looked at the references yet but I did a quick check for 2 parameters and it seems that the propagation of the measurement noise to the gradient estimate, when choosing a random sequence, is actually slightly worse (but asymptotically for many flips it does not matter). For many parameters (>> batch size) the advantages of a random sequence are clear (there is simply no good deterministic sequence).

vondele commented 4 years ago

of course we're normally using pseudo-random and thus deterministic sequences... for (quasi) monte carlo sampling, people have been using Sobol sequences with some luck. BTW, I doubt we can really do SPSA optimizations on fishtest in very high dimensions... would already be cool if it worked for <10 params reliably.

vdbergh commented 4 years ago

@ppigazzini I looked at your second reference and it describes exactly the algorithm we are proposing to implement! See \S3.1 In particular we should vary the flips as much as we can and the batch size should increase with c!

tomtor commented 4 years ago

I looked at your second reference and it describes exactly the algorithm we are proposing to implement! See \S3.1 In particular we should vary the flips as much as we can and the batch size should increase with c!

Is it possible to implement this in an agile way and change the batch size and parameter (flip) calculations, but leave the current simple minded concurrency implementation as is in the first implementation?

The reason I propose this, is that optimal efficiency is not our first concern. The current implementation seems to be unable to tune parameters unless it results in a gain of eg at least 5 Elo, even playing millions of games will not change that. If that is caused by the current concurrency implementation then we should obviously fix that first, but from the discussion I get the impression that that is not the main issue.

vdbergh commented 4 years ago

@tomtor If we don't mind that some outdated results are used in the gradient calculation then I don't think one needs a lot of code. Worker changes aren't even necessary. Basically on the server we would not change parameters immediately after every worker update but wait till a larger contingent of results is collected and then do a more refined gradient estimation. I can try to think about it but right now I am extremely busy in my professional life (the issue is not writing the code, which is usually trivial, but validating it so that one is confident that there are no bugs).

official-stockfish / fishtest