Score not propagating between moves

CivilizationalAgency commented 1 month ago

Since yesterday I've noticed that scores are not accurately propagating back between moves like they used to (whilst also allowing for some loss/regression to 0 for uncertainty), so the score is contradicting itself between moves and the move ranking is completely wrong, since the score of a move is no longer given by the evaluation of the final move of the best line

noobpwnftw commented 1 month ago

I've changed the score backup function to a more well-defined weighed averaging scheme. It is expected to be more accurately propagating leaf scores back to root, however this change can take some time to reach every line.

Bratish971 commented 1 month ago

I've changed the score backup function to a more well-defined weighed averaging scheme. It is expected to be more accurately propagating leaf scores back to root, however this change can take some time to reach every line.

Does it means, what if in main line at the end score 0, at start of line score be different from 0?

CivilizationalAgency commented 1 month ago

It does not seem to be propagating even between consecutive moves, for instance for the chess database the strongest first move at the time of writing has a score of 6, but the strongest responses from black have a score of -1. Previously the largest discrepancy between consecutive moves was 2 points if I recall correctly

robertnurnberg commented 1 month ago

Note that the score of the best move is no longer equal to the "evaluation" of that position on cdb. The evaluation of the position is now based on https://en.wikipedia.org/wiki/Softmax_function. For the position after 1. d4, we get this weighted average:

> python cdbeval.py --san "1. d4"
move:  g8f6, score:   -1, weight: 1.000000
move:  d7d5, score:   -1, weight: 1.000000
move:  e7e6, score:   -3, weight: 0.818731
move:  c7c6, score:   -9, weight: 0.449329
move:  d7d6, score:  -13, weight: 0.301194
move:  g7g6, score:  -17, weight: 0.201897
move:  f7f5, score:  -21, weight: 0.135335
move:  a7a6, score:  -26, weight: 0.082085
move:  c7c5, score:  -29, weight: 0.060810
move:  b8c6, score:  -29, weight: 0.060810
move:  h7h6, score:  -71, weight: 0.000912
move:  a7a5, score:  -72, weight: 0.000825
move:  b8a6, score:  -75, weight: 0.000611
move:  b7b6, score:  -79, weight: 0.000410
move:  g8h6, score: -105, weight: 0.000030
move:  h7h5, score: -114, weight: 0.000012
move:  b7b5, score: -126, weight: 0.000004
move:  e7e5, score: -140, weight: 0.000001
move:  f7f6, score: -143, weight: 0.000001
move:  g7g5, score: -227, weight: 0.000000
Weighted eval:  -5.971027695491816

If you want to test this also for other positions, you can use this script: cdbeval.py.

CivilizationalAgency commented 1 month ago

Note that the score of the best move is no longer equal to the "evaluation" of that position on cdb. The evaluation of the position is now based on https://en.wikipedia.org/wiki/Softmax_function. For the position after 1. d4, we get this weighted average:

> python cdbeval.py --san "1. d4"
move:  g8f6, score:   -1, weight: 1.000000
move:  d7d5, score:   -1, weight: 1.000000
move:  e7e6, score:   -3, weight: 0.818731
move:  c7c6, score:   -9, weight: 0.449329
move:  d7d6, score:  -13, weight: 0.301194
move:  g7g6, score:  -17, weight: 0.201897
move:  f7f5, score:  -21, weight: 0.135335
move:  a7a6, score:  -26, weight: 0.082085
move:  c7c5, score:  -29, weight: 0.060810
move:  b8c6, score:  -29, weight: 0.060810
move:  h7h6, score:  -71, weight: 0.000912
move:  a7a5, score:  -72, weight: 0.000825
move:  b8a6, score:  -75, weight: 0.000611
move:  b7b6, score:  -79, weight: 0.000410
move:  g8h6, score: -105, weight: 0.000030
move:  h7h5, score: -114, weight: 0.000012
move:  b7b5, score: -126, weight: 0.000004
move:  e7e5, score: -140, weight: 0.000001
move:  f7f6, score: -143, weight: 0.000001
move:  g7g5, score: -227, weight: 0.000000
Weighted eval:  -5.971027695491816

If you want to test this also for other positions, you can use this script: cdbeval.py.

Thank you for the response! I understand that giving a greater weighting to suboptimal moves would make the score more robust to an incorrectly calculated best response so scores should be more stable, the tradeoff being that the weighting of the strongest move is diluted. Intuitively this would become most useful for moves evaluated to a shallower depth where there is greater uncertainty, and conversely for moves with greater depth you just use the best response. I see the temperature parameter in the script, where is it coming from? It would make sense to me if it was inversely related to evaluation depth, but this doesn't seem to be the case since already for the evaluation of the first moves (e.g. 1. d4) the weighting of the best response is already being diluted. Or is it just because it isn't updated yet like @noobpwnftw mentioned?

robertnurnberg commented 1 month ago

Yes, the script uses the same (global) temperature as cdb. For a more detailed discussion of the pros and cons you could join the chessdb channel on the stockfish discord server: https://discord.com/channels/435943710472011776/1101022188313772083

CivilizationalAgency commented 1 month ago

Has the use of a dynamic temperature as a function of PV depth been considered to restore a more useful score for positions with a high eval depth/low uncertainty?

noobpwnftw commented 1 month ago

Don't have a way to make estimations of that, I guess given time it'll solve the problem by itself.

noobpwnftw / chessdb

Score not propagating between moves #26