SPSA improvements [RFC]

ppigazzini commented 4 years ago

Issue opened to collect info about possible future SPSA improvements.

SPSA references

SPSA is a fairly simple algorithm to be used for local optimization (not global optimization). The wiki has now a simple documentation to explain the SPSA implementation in fishtest Here is other documentation:

SPSA seminal paper cited by @zamar in the forum
SPSA website
wikipedia SPSA page

SPSA implementation problems/improvements

we ask for "c_k_end" and "r_k_end" (final parameters values), but IMO we should ask for for "c" and "r" (starting values) if those are too big the SPSA diverges
we use "r_k = a_k / c_k^2" instead of "r_k = a_k" (I searched unsuccessfully a reference in some SPSA papers)
we set "c_k_end" and "r_k_end" for any single variable to be optimized (the original SPSA uses global values): this makes sense to account for the different sensitivity of the variables, but IMO this should be dealt with an internal normalization of the variables values based upon the starting values and the bounds.
one iteration should be set to a 2 games for match, but our worker code cannot support this, so we set one iteration to a 2*N_cores games for match
compute an averaged SP gradient per iteration to lower the noise
we have experimental code (special rounding and clipping) that nobody use: I'm afraid that it's theoretically correct but not very useful for the rough way we use SPSA
"A" parameter should be computed from the number of games
the worker passes rounded values to cutechess-cli: we should normalize the variables values to have the same resolution for all the variables

SPSA testing process (aka Time Control)

EDIT_000 this paragraph is outdated, I kept it to avoid disrupting the chain of posts:

read the wiki for a SPSA description https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#tuning-with-spsa
the experience on these last years has shown that a very short time control on fishtest is not working:
- with NNUE, workers running on dual CPU have time losses at ultra short time control (USTC)
- that SPSA using or LTC or even ULTC has a high Signal/Noise ratio that helps the convergence. A ULTC match is very drawish, so in SPSA one side will win a pair of games only if the parameters random increments are somehow aligned with the gradient direction

I suggest this process to optimize the developer time and the framework CPU.

first steps: run some SPSAs at Ultra STC (e.g. 1+0.01) to find good "c_k_end", "r_k_end" values and some good variables starting values. This can be done or locally with a recent CPU or in fishtest.
last step: run a final SPSA in fishtest to optimize the variables for a longer TC (e.g. STC, 20+0.2, LTC etc.)

I took a SPSA from fishtest and run it locally changing only the the TC, the results are similar:

20+0.2 - original test on fishtest: https://tests.stockfishchess.org/tests/view/5e2dade6ab2d69d58394fb5e

20+02

2+0.02:

2+002

1+0.01::

1+001

0.5+0.01:

05+001

ppigazzini commented 4 years ago

The reason I propose this, is that optimal efficiency is not our first concern. The current implementation seems to be unable to tune parameters unless it results in a gain of eg at least 5 Elo, even playing millions of games will not change that. If that is caused by the current concurrency implementation then we should obviously fix that first, but from the discussion I get the impression that that is not the main issue.

IMO we have two problems:

greedy concurrency implementation
outdated SPSA vanilla with very slow convergence

We should improve our SPSA:

switch to a serial SPSA implementation using concurrency to make robust gradient estimations (mini-batch, mean gradient or least square estimation)
speedup SPSA convergence with:
- or second order SPSA: we should be able to compute the hessian (I changed my mind after studying https://arxiv.org/pdf/1704.00223 under the sunlight :) according to the algorithm 2 pag 9
- or using a very simple adaptive step like this (ie change alpha according to the variation of the gradient direction) https://dl.acm.org/doi/pdf/10.1145/3242840.3242845

We could start implementing the easier fixes.

vdbergh commented 4 years ago

I should specify that I don't know if there are fundamental issues with the Fishtest implementation of SPSA (we could improve it if the worker were multi-threaded).

The only thing I am a bit worried about is that our gradient measurements are very noisy. Moreover the measurement noise (sigma) is proportional to 1/c so it goes up as the algorithm progresses. This seems worrisome. But perhaps SPSA sort of magically compensates for this.

vdbergh commented 4 years ago

I actually have some Python SPSA simulation code (single threaded, single parameter) lying around.

I can try to do quick check tonight for a quadratic loss function to see if batching seems beneficial.

ppigazzini commented 4 years ago

I actually have some Python SPSA simulation code (single threaded, single parameter) lying around.

I can try to do quick check tonight for a quadratic loss function to see if batching seems beneficial.

Try the adaptive step too :)

vdbergh commented 4 years ago

@ppigazzini The adaptive step size does not apply in the single parameter case. But it seems to be rather adhoc anyway. Using the Hessian could be considered later (once we have larger batches we have better control of the noise).

ppigazzini commented 4 years ago

@ppigazzini The adaptive step size does not apply in the single parameter case. But it seems to be rather adhoc anyway. Using the Hessian could be considered later (once we have larger batches we have better control of the noise).

I missed the single parameter thing :)

By the way I have understood why Joona inserted r_k: the increment for a win is the product of the two parameters inserted in the "New" form (c_end and r_end): product easier that division in mental calculation.

vdbergh commented 4 years ago

@ppigazzini This is exactly what I wrote above :) https://github.com/glinscott/fishtest/issues/535#issuecomment-622543841

ppigazzini commented 4 years ago

@ppigazzini This is exactly what I wrote above :) #535 (comment)

The Joona's "makes some sense" not in the math world, but in the "try another couple of values" world. EDIT but "r_k" makes little sense "wrt gk=ak/ck", no mental calculation at all :)

Anyway, "c_end" and "r_end" should be automatically (and easily) computed from the variable range [min;, max], and any p value (p_start, p_k) should be limited in [min+c_k; max-c_k] in order to makes always a proper gradient estimation.

vdbergh commented 4 years ago

I am browsing Spall's papers and he says that c should be 1-sigma noise. This is something we can check to see if we have too much noise or not.

@ppigazzini What is a typical value for c that is used?

EDIT. Ah it depends on the function params-->Elo but we can try Vondele's example above.

vdbergh commented 4 years ago

1-sigma noise is of course not well defined. If we multiply our loss function by 10 then the noise goes up by 10 but the optimization problem does not change. What does he mean?

ppigazzini commented 4 years ago

@vdbergh "c" drives the gradient estimation and depends on the function sensibility wrt p (I'm playing with c_end=40 on DEV), "c*r" drives the rate of change.

EDIT_000 From Joona's README

6. Practical Guidelines
=======================

To be able to successfully use SPSA for tuning, one needs to determine good values for coefficients ck and Rk.
In most cases, it's impossible to determine ideal values. 
For evaluation related parameters, we've had success with the following values:
Rk for the last iteration (CSV-file, column 5): 0.002
ck for the last iteration (CSV-file, column 4): 4 centipawns (= 8 in Stockfish's internal scale)

However if parameter is very insensitive ELO-wise, one needs to use a larger value for ck.
Also if one has even some sort of idea about the ELO-sensitivity of the parameter and how far from optimum
it might be at maximum, one can use the built-in simulator, to find good values for Rk and ck.
See files 'simul_1var.conf' and 'simul_1var.var' as an example.

vdbergh commented 4 years ago

No I do not understand the 1-sigma recommendation. Sad because it seemed like something quantitative.

ppigazzini commented 4 years ago

@vdbergh the last Spall paper Efficient Implementation of Second-Order Stochastic Approximation Algorithms in High-Dimensional Problems(August 2019) has public Matlab code https://github.com/jingyi-zhu/Fast2SPSA

vdbergh commented 4 years ago

@ppigazzini I guess I am more of a theoretical person. I would prefer to understand our noise situation first.

The convergence proof of SPSA works in the presence of noise. This is because it is suppressed by the learning rate (a) which goes to zero much faster than c. So from that perspective we cannot have too much noise. But then why the (nonsensical) 1-sigma recommendation for c? Presumably because noise has an effect on convergence speed. This I would like to understand.

ppigazzini commented 4 years ago

@vdbergh please let me know the Spall's paper with the c=1-sigma suggestion. In the seminal paper Spall wrote to chose "c at a level approximately equal to the standard deviation of the measurement noise in y(theta) in order to keep the elements of the gradient estimation form getting excessively large in magnitude".

vdbergh commented 4 years ago

@ppigazzini Yes you quoted it correctly. But the suggestion makes no sense. If I put y1(theta)=10*y(theta) then the noise in y1 is 10 times bigger so c should be 10 times bigger as well. But then this means that we use a different finite difference formula for the derivative of y1 than for y. But the optimization problems for y and y1 are exactly equivalent.

Edit: assume my y is expressed in cm and yours in mm.

ppigazzini commented 4 years ago

@vdbergh IMO Spall is suggesting that c should be proportional to y_inv(noise_level) "the std can be estimated by collecting several y(tetha) values at the initial guess theta_0, a precise estimate is not required."

I guess I'm more a practical person, I prefer to follow the later papers that often report errata-corrige indications :)

vdbergh commented 4 years ago

@ppigazzini The suggestion only makes sense if y and theta are expressed in the same units. I guess in our case this would mean that y and theta have to be expressed in evaluation units (pawns). That actually makes a lot of sense in our domain.

EDIT I recall that as a rule of thumb 1 pawn = 100 Elo.

vdbergh commented 4 years ago

@ppigazzini When you said c_end=40 above. How many evaluation units (pawns) does this 40 represent?

vondele commented 4 years ago

I'm trying to follow the exchange. If I have a model like Elo(x) = 1.49643 - 1./2 * ((x - 151.148) / 23.6133) ** 2 (above data derived), I presumably would use c ~23.61 and that would be 0.5 Elo, right ? This c will be different for any parameter, and that's why presumably that should derive from the interval specified (e.g. 1/6 interval), assuming that dev will give bounds that roughly reflect a window in which Elo changes are expected to be e.g. +- 5 Elo.

ppigazzini commented 4 years ago

@ppigazzini The suggestion only makes sense if y and theta are expressed in the same units. I guess in our case this would mean that y and theta have to be expressed in evaluation units (pawns). That actually makes a lot of sense in our domain.

EDIT I recall that as a rule of thumb 1 pawn = 100 Elo.

@vdbergh I wrote c proportional to the inverse of the function c: E[y(theta_0+c)]>=E[y(theta_0)]+std[y(theta_0)] EDIT: more precise this formula c: E[(y(theta_0+c)-y(theta_0-c))/2c]>=std[y(theta_0)] EDIT_1: no, was right the first formula c: E[y(theta_0+c)]>=E[y(theta_0)]+std[y(theta_0)]

This is the only interpretation of Spall's sentence that makes sense to me.

ppigazzini commented 4 years ago

@vdbergh I'm testing with KnightSafeCheck (master value 790) with a range [0,1500] and starting point 100. I set c_end=40 to hope to have a gradient able to drive the convergence with a starting point so far away form the (supposed) optimal value , and r_end=0.1 to have a r_end*c_end=4 in order to move with good speed.

IMO c_end and r_end*c_end depend from the task.

vondele commented 4 years ago

@ppigazzini KnightSafeCheck is worth about 1.5 Elo in total (https://tests.stockfishchess.org/tests/view/5eaadb0109d25e8e505805d1) so that is a tough (but realistic) testcase.

vdbergh commented 4 years ago

@ppigazzini The suggestion only makes sense if y and theta are expressed in the same units. I guess in our case this would mean that y and theta have to be expressed in evaluation units (pawns). That actually makes a lot of sense in our domain. EDIT I recall that as a rule of thumb 1 pawn = 100 Elo.

@vdbergh I wrote c proportional to the inverse of the function c: E[y(theta_0+c)]>=E[y(theta_0)]+std[y(theta_0)]

This is the only interpretation of Spall's sentence that makes sense to me.

I do not really believe this as it would not solve the problem I was pointing out: y1=10*y. I think my interpretation is more likely. theta and y should be expressed in the same units. In our case this would be pawns. KnightSafeChecks is tricky since it not very obvious how the evaluation depends on it so converting it to pawns is not obvious. The formula is very convoluted (one would need to gather some statistics). Traditional terms such as PSQT or piece values can be expressed directly in pawns.

ppigazzini commented 4 years ago

@ppigazzini The suggestion only makes sense if y and theta are expressed in the same units. I guess in our case this would mean that y and theta have to be expressed in evaluation units (pawns). That actually makes a lot of sense in our domain. EDIT I recall that as a rule of thumb 1 pawn = 100 Elo.

@vdbergh I wrote c proportional to the inverse of the function c: E[y(theta_0+c)]>=E[y(theta_0)]+std[y(theta_0)] This is the only interpretation of Spall's sentence that makes sense to me.

I do not really believe this as it would not solve the problem I was pointing out: y1=10*y. I think my interpretation is more likely. theta and y should be expressed in the same units. In our case this would be pawns. KnightSafeChecks is tricky since it not very obvious how the evaluation depends on it so converting it to pawns is not obvious. The formula is very convoluted (one would need to gather some statistics). Traditional terms such as PSQT or piece values can be expressed directly in pawns.

@vdbergh I edited the formula in c: E[(y(theta_0+c)-y(theta_0-c))/2c]>=std[y(theta_0)] This works for y and y1=10*y and find the same c value for both functions.

With my (bad) English: find a "c" that permit to estimate a gradient with a bigger value than the local measurement noise. IMO is a complete non sense to mix the dependent variable with the independent variable (?!?)

vdbergh commented 4 years ago

I'm trying to follow the exchange. If I have a model like Elo(x) = 1.49643 - 1./2 * ((x - 151.148) / 23.6133) ** 2 (above data derived), I presumably would use c ~23.61 and that would be 0.5 Elo, right ? This c will be different for any parameter, and that's why presumably that should derive from the interval specified (e.g. 1/6 interval), assuming that dev will give bounds that roughly reflect a window in which Elo changes are expected to be e.g. +- 5 Elo.

@vondele I do not immediately understand how you arrive at c=23.61. The Elo function is unknown.

I think it now to some extent decidable if we have too much noise or not... Tomorrow...

vdbergh commented 4 years ago

c: E[(y(theta_0+c)-y(theta_0-c))/2c]>=std[y(theta_0)]

@ppigazzini I still don’t understand it. You seem to be imposing a lower bound on the expectation value of the gradient...

I really don’t understand why you have a problem with theta and y being expressed in the same units. It is true for us if we express parameters in pawns. Elo (the value of y) can be converted to pawns as well.

vdbergh commented 4 years ago

In fact Joona in his guidelines expresses c also in pawns. This only makes sense if theta is also expressed in pawns....

ppigazzini commented 4 years ago

c: E[(y(theta_0+c)-y(theta_0-c))/2c]>=std[y(theta_0)]

@ppigazzini I still don’t understand it. You seem to be imposing a lower bound on the expectation value of the gradient...

@vdbergh you are right, I wrote the first time c: E[y(theta_0+c)]>=E[y(theta_0)]+std[y(theta_0)] but we are working with the gradient and c: E[(y(theta_0+c)-y(theta_0-c))/2c]>=std[y(theta_0)] makes no sense...

Why cannot we chose a_0 and c_0 very little?

ppigazzini commented 4 years ago

Why cannot we chose a_0 and c_0 very little?

Because cutechess-cli accepts only integer, we use round(p+c) and round(p-c). f'=(f(round(p+c))-f(round(p-c)))/c

vdbergh commented 4 years ago

@ppigazzini cutechess-cli option handling is not a real problem. The tuning branch could scale the options. Or it could use string representations of floats, combined with string options.

But a genuine issue could be that the internal resolution of SF is not infinite (1/256 pawn). The stochastic rounding is meant as a remedy for this. But I feel that an evaluation change of 1/256 pawn is far too low to be detectable. So maybe the internal resolution is enough. Of course there is also loss of precision in the internal calculations in SF.

vdbergh commented 4 years ago

Ok I finally got somewhere combining Joona's recommendations with Spall's assuming that parameters are expressed in evaluation units (like Joona does). Still need to do simulations :) Output

Joona recommendation : c_end=4 centipawns (=4 elo)
==================================================
N=200000 c_end=4.000000 c=13.723473

Spall recommendation : c=1 sigma measurement noise
==================================================
draw_ratio=0.610000 sigma_elo(1 game)=216.973454
batch_size=249.968430

So for an STC tune of 200000 games a batch_size of 250 seems reasonable (flips should change more frequently of course as these are related to the intrinsic noise). Here is the small Python program

draw_ratio=0.61

# Joona

c_end= 4 # 4 centipawn=4 elo

N=200000   # tune length

c=c_end*(N**0.101)

print("Joona recommendation : c_end=4 centipawns (=4 elo)")
print("==================================================")
print("N=%d c_end=%f c=%f" % (N,c_end,c))

# Spall: c=1-sigma noise

def L(x):
    return 1/(1+10**(-x/400.0))

# conversion factor Elo <-> score

conv_elo_score=(L(0.001)-L(0.000))/0.001

sigma_score=(1-draw_ratio)**0.5/2.0

sigma_elo=sigma_score/conv_elo_score

print("")
print("Spall recommendation : c=1 sigma measurement noise")
print("==================================================")
print("draw_ratio=%f sigma_elo(1 game)=%f" % (draw_ratio,sigma_elo))

batch_size=(sigma_elo/c)**2

print("batch_size=%f" % batch_size)

vondele commented 4 years ago

I would prefer parameters to remain integer honestly. The final eval is in PawnValueEg (213 right now). I would agree that changes of 1 internal unit (1/213) are too small to be really detectable. Furthermore, when tuning devs often write foo p / 128, so that p effectively becomes higher resolution. I must say that not a parameters will be in pawn units, there are expressions that are quadratic in the parameters, or even mixed like score(kingDanger kingDanger / 4096, kingDanger / 16) where kingDanger depends on multiple parameters.

@vdbergh concerning Elo(x) = 1.49643 - 1./2 * ((x - 151.148) / 23.6133) ** 2 leading me to a c=23.61 note that this is just based on the second derivative of Elo(x).. d^2 / dx^2 Elo(x) = 1/(23.61) ^2 so the characteristic scale for x ~ 23.61. IMO, c should be proportional to 1/sqrt(d^2 / dp^2 Elo(p)) (which in general we don't have, and I'm pretty skeptical we can estimate the Hessian from our very noisy measurements). This form of writing things might also be the meaning of 1-sigma by Joona? With the quadratic form written as I did, 23.6133 would be sigma if we look at Exp(Elo(x)) like a Gauss curve.

BTW, the more I think about the limit c-->0 the less I like it for practical work... yes, you need this limit for the algorithm to be exact (i.e. only in that limit will the derivative be correct in the presence of higher order terms), but for all practical purposes we're unable to overcome the noise necessary to measure accurately.

vdbergh commented 4 years ago

@vondele I just posted something about c.

vdbergh commented 4 years ago

@vondele Even if a parameter is not in pawn units one can still convert it to pawn units but it will be statistical. Essentially you want something like E(d eval_white/ d theta) as a conversion factor. It is not important that it is very precise.

vondele commented 4 years ago

yes, I agree with that last statement, but it will be the second derivative need for the conversion c \propto 1/sqrt(d^2 / dp^2 Elo(p)) IMO. (This is like standard preconditioning, with diagonal terms of the Hessian matrix).

vdbergh commented 4 years ago

@vondele I do not think so. A parameter acts on SF through its influence on the evaluation function. So a natural unit for a parameter is expressible in the first derivative of the evaluation function (for one side) with respect to it.

Of course the sensitivity of a parameter is related to the second derivative of the (unknown) function params-->Elo.

vondele commented 4 years ago

At the minimum the first derivative is always zero ?

vondele commented 4 years ago

wait evaluation function..., that's different, could work.

vdbergh commented 4 years ago

@vondele No. You are confusing the evaluation function with the loss function (I am probably not explaining it clearly enough).

The evaluation function is the one built into SF. The loss function (params->Elo) is unknown.

The first derivative of the loss function is zero at the maximum. But not the first derivative of the evaluation function.

vondele commented 4 years ago

yes, I agree. I'm talking about the second derivative of the loss function that is what is needed (and obviously that is unknown and thus harder to get, for the one example with the data above we actually have it Elo(x) = 1.49643 - 1./2 * ((x - 151.148) / 23.6133) ** 2 based on the measurements, which is why I keep referring to it.

If we go in detail it also shows how difficult it is to use the first derivative, this parameter (coef1) goes into the eval function as:

    kingDanger +=        kingAttackersCount[Them] * kingAttackersWeight[Them]
                 + coef1 * popcount(kingRing[Us] & weak)
                 + 148 * popcount(unsafeChecks)
                 +  98 * popcount(pos.blockers_for_king(Us))
                 +  69 * kingAttacksCount[Them]
                 +   3 * kingFlankAttack * kingFlankAttack / 8
                 +       mg_value(mobility[Them] - mobility[Us])
                 - 873 * !pos.count<QUEEN>(Them)
                 - 100 * bool(attackedBy[Us][KNIGHT] & attackedBy[Us][KING])
                 -   6 * mg_value(score) / 8
                 -   4 * kingFlankDefense
                 +  37;
    // Transform the kingDanger units into a Score, and subtract it from the evaluation
    if (kingDanger > 100)
        score -= make_score(kingDanger * kingDanger / 4096, kingDanger / 16);

so via kingDanger linear in the endgame, quadratic in the midgame. I would not really know (from this code) what the c should be when optimizing. From the (unknown) loss function, it is quite direct.

vondele commented 4 years ago

Nevertheless, I think we might have a statement we can both agree on. If we're playing about 200k games, the c_end should be roughly such that one expect it to cause approx. 4 Elo change in results. How one gets to that c_end value is unclear, either by an unknown loss function, or some assumption that a change in a parameter worth 4cp is roughly 4 Elo. Agree?

vdbergh commented 4 years ago

@vondele I do not think c_end refers to the amount of elo that should be detectable. It is just about the finite difference approximation (to get rid of the third derivatives in the loss function). But there is a connection through the measurement noise as I wrote in my post above.

Concerning conversion to pawns. I think the correct scale for 1 parameter is

sqrt(E((d eval/d theta)**2))

(starting from the observation that a parameter acts on SF by changing the evaluation function).

This can be very easily determined statistically. Just let sf search and compute regularly

(eval(theta+epsilon)-eval(theta-epsilon))/(2 epsilon)

and then compute the root mean square of this.

Dreaming:

For multiple parameters one should compute (statistically) the matrix

E( (d eval(theta_i)/d theta_i) x (d eval(theta_j)/d theta_j))   (*)

and then diagonalize it. But I suspect that (*) usually will be close to diagonal (otherwise there is interaction between the parameters, which is bad).

EDITED: changed ii to ij in (*)

vondele commented 4 years ago

@vdbergh c_end is indeed strongly connected to the measurement noise, in the absence of measurement noise, clearly c_end -> 0. But we don't care about the absence of measurement noise, right? That's why I said (based on your previous post) for about 200k games (and it's implied measurement noise), the change should be such that ~4 Elo is observed.

I agree your method of looking at the change of eval statistically, is actually rather practical, and interesting to try and set the scale. I'll do it for the coef1 case in kd and see what we get.

Meanwhile, I started a couple of measurements for the 4cp ~ 4Elo hypothesis: https://tests.stockfishchess.org/tests/view/5eafbcd8a460c0e4b8b9ba33 https://tests.stockfishchess.org/tests/view/5eafbccba460c0e4b8b9ba30 https://tests.stockfishchess.org/tests/view/5eafbcbfa460c0e4b8b9ba2e

Edit: none of the tests showed a large Elo impact 0.61 (Pawns), -0.05 (Knight), -0.41 (Queen) all +- 2.1 Elo error.

vdbergh commented 4 years ago

@vondele Good luck (but I now think that the SPSA implementation in Fishtest might be fundamentally broken because too small batch sizes).

I also realized that my method for defining a natural scale for a parameter does not directly correspond to intuition. E.g. if we try to apply it to piece values then the expression I gave would only receive contributions from materially unbalanced positions and not from all positions. So the scale would be less than 1/1 even though piece values are really expressed in pawns :(

One might also argue that intuition is not important but I think Joona really meant 4 centipawns in an intuitive sense.

EDIT: I wonder if it would make sense to take the average only over positions where the finite difference expression is non-zero. This would give something that corresponds to intuition but somehow if feels wrong...

ppigazzini commented 4 years ago

I also realized that my method for defining a natural scale for a parameter does not directly correspond to intuition. E.g. if we try to apply it to piece values then the expression I gave would only receive contributions from materially unbalanced positions and not from all positions. So the scale would be less than 1/1 even though piece values are really expressed in pawns :(

@vdbergh It's should easy to check if your natural scale is working at least for the pieces values, just compute the scales for some pieces and control if the relative values are respected. In that case we can easily restore the pawn scale.

Perhaps the paper about game parameters optimization has some suggestions: https://www.researchgate.net/publication/220343758_Universal_parameter_optimisation_in_games_based_on_SPSA

vdbergh commented 4 years ago

@ppigazzini Thanks for the paper. I'll have a look at.

Concerning scales. The intuition would be that all the scales would be 1/1 (since piece values are expressed in pawns, and we want to arrive at a scale in pawns). However I suspect that the scales for the different pieces as measured by my method will be different from each other, simply because e.g. the average number of positions with queen imbalances will be different from the average number of positions with knight imbalances.

I still suspect that my scales will be best for SPSA, but one has to be careful to connect them to Joona's recommendation.

vondele commented 4 years ago

@ppigazzini the (Elo) calculations are running for 3 piece values...

meanwhile I have some numbers for the coef1 case. What I compute is sqrt(average( (100 * d Eval / d coef1) ** 2 )) (the precise code is below), this is in 'internal units' and scaled by a factor 100, to enable calculations in ints.

The result is ~49 in that case. The finite difference approximation is very accurate for this term (i.e. eval is roughly quadratic in coef1, as it should be). But the averaging is a little sensitive on what input configurations are used / what depth is being used. For example:

average as a function of depth on default bench

bench default, depth test
13 1739.78 -> sqrt = 41.7
20 2143.78 -> sqrt = 46.29
26 2398.81 -> sqrt = 49
30 2881.17 -> sqrt = 54

average as a function of delta on default bench depth 26

10 2427.49 -> 49.269
23 2398.81 -> 48.977
46 2395.3  -> 48.941

code diff:

diff --git a/src/evaluate.cpp b/src/evaluate.cpp
index 67e059210..3f411842b 100644
--- a/src/evaluate.cpp
+++ b/src/evaluate.cpp
@@ -153,6 +153,8 @@ namespace {

 #undef S

+  int coef1 = 185;
+
   // Evaluation class computes and stores attacks tables and other working data
   template<Tracing T>
   class Evaluation {
@@ -448,7 +450,7 @@ namespace {
     int kingFlankDefense = popcount(b3);

     kingDanger +=        kingAttackersCount[Them] * kingAttackersWeight[Them]
-                 + 185 * popcount(kingRing[Us] & weak)
+                 + coef1 * popcount(kingRing[Us] & weak)
                  + 148 * popcount(unsafeChecks)
                  +  98 * popcount(pos.blockers_for_king(Us))
                  +  69 * kingAttacksCount[Them]
@@ -862,7 +864,18 @@ namespace {
 /// evaluation of the position from the point of view of the side to move.

 Value Eval::evaluate(const Position& pos) {
-  return Evaluation<NO_TRACE>(pos).value();
+  int x_orig = coef1;
+  int delta = 10;
+  coef1 = x_orig + delta;
+  Value eval_p1 = Evaluation<NO_TRACE>(pos).value();
+  coef1 = x_orig - delta;
+  Value eval_m1 = Evaluation<NO_TRACE>(pos).value();
+  coef1 = x_orig;
+  Value eval    = Evaluation<NO_TRACE>(pos).value();
+  int x = (eval_p1 - eval_m1) * 100 / (2 * delta);
+  // dbg_mean_of(x);
+  dbg_mean_of(x*x);
+  return eval;
 }

vdbergh commented 4 years ago

@vondele Thanks! This is interesting. From the examples you give I would say that the value is not particularly sensitive to the testing conditions (41.7-54). After all, for now, it would only be a method to define a "reasonable" scale for a parameter. If I understand correctly then one unit of theta would correspond to ~5 SF internal units.

ppigazzini commented 4 years ago

@ppigazzini the (Elo) calculations are running for 3 piece values...

@vondele not easy to keep the pace :)

Dreaming For multiple parameters one should compute (statistically) the matrix
E( (d eval(theta_i)/d theta_i) x (d eval(theta_j)/d theta_j))   (*)
and then diagonalize it. But I suspect that (*) usually will be close to diagonal (otherwise there is interaction between the parameters, which is bad).

@vdbergh Dreaming on :) If you mean that it's bad for SPSA (because we want to optimize using orthogonal directions) it's seems applicable a PCA:

computing the SVD of the symmetric matrix to get the orthogonal "U" directions and the singular value "s" to be used to normalize the scale of the directions
select only the subset of not correlated directions (e.g. take the directions that add up the 99% of energy of the data, or drop directions with abs(s_i)/abs(s_0) < epsilon etc.)

PS: that is the paper "that describes exactly the algorithm we are proposing to implement"

official-stockfish / fishtest