Winning evaluation with tablebases in cursed win

dav1312 commented 7 months ago

Describe the issue

Winning evaluation in a cursed win even when using tablebases

https://lichess.org/analysis/standard/8/8/6k1/3B4/3K4/4N3/8/8_w_-_-_54_106

Expected behavior

An evaluation (ideally at depth 1?) of 0.00

Steps to reproduce

Stockfish dev-20240413-c55ae376 by the Stockfish developers (see AUTHORS file)
setoption name SyzygyPath value tb345
info string Found 145 tablebases
position fen 8/8/6k1/3B4/3K4/4N3/8/8 w - - 54 106
go infinite
info string NNUE evaluation using nn-ae6a388e4a1a.nnue
info string NNUE evaluation using nn-baff1ede1f90.nnue
info depth 1 seldepth 2 multipv 1 score cp 20000 nodes 1 nps 333 hashfull 0 tbhits 26 time 3 pv d4e5
info depth 2 seldepth 2 multipv 1 score cp 20000 nodes 2 nps 666 hashfull 0 tbhits 26 time 3 pv d4e5
info depth 3 seldepth 2 multipv 1 score cp 20000 nodes 3 nps 1000 hashfull 0 tbhits 26 time 3 pv d4e5
info depth 4 seldepth 2 multipv 1 score cp 20000 nodes 4 nps 1333 hashfull 0 tbhits 26 time 3 pv d4e5
info depth 5 seldepth 3 multipv 1 score cp 20000 nodes 11 nps 3666 hashfull 0 tbhits 26 time 3 pv d4e5
info depth 6 seldepth 4 multipv 1 score cp 20000 nodes 41 nps 10250 hashfull 0 tbhits 26 time 4 pv d4e5 g6g7 e3g2
info depth 7 seldepth 7 multipv 1 score cp 20000 nodes 194 nps 48500 hashfull 0 tbhits 26 time 4 pv d4e5 g6g7 e3f5 g7f8
info depth 8 seldepth 7 multipv 1 score cp 20000 nodes 442 nps 110500 hashfull 0 tbhits 26 time 4 pv d4e5 g6g7 e3g2 g7f8 d5c4
info depth 9 seldepth 8 multipv 1 score cp 20000 nodes 1119 nps 223800 hashfull 0 tbhits 26 time 5 pv d4e5 g6h6 e5e4 h6g7 e4d3 g7f8
info depth 10 seldepth 9 multipv 1 score cp 20000 nodes 1518 nps 303600 hashfull 0 tbhits 26 time 5 pv d4e5 g6g7 e5f5 g7h8

Operating system

All

Stockfish version

master

Disservin commented 7 months ago

(Apparently present since sf6, according to discord)

jhellis3 commented 7 months ago

I translated the code to pencil & paper math, and AFAICS it all checks out. It is simply getting the wrong result from the DTZ probe. Why that is.... IDK.

dav1312 commented 7 months ago

I can test later but there are some alternative dtz tablebases called "nr" which seemed to fixed the problem for crafty and pere. It makes more sense that the issue was not in Stockfish in the first place. https://tablebase.lichess.ovh/tables/standard/

dav1312 commented 7 months ago

Using dtz_nr tablebases

Stockfish dev-20240413-c55ae376 by the Stockfish developers (see AUTHORS file)
setoption name SyzygyPath value tb345;tb345_dtz_nr
info string Found 145 tablebases
position fen 8/8/6k1/3B4/3K4/4N3/8/8 w - - 54 106
go depth 10
info string NNUE evaluation using nn-ae6a388e4a1a.nnue
info string NNUE evaluation using nn-baff1ede1f90.nnue
info depth 1 seldepth 2 multipv 1 score cp 25 nodes 1 nps 333 hashfull 0 tbhits 26 time 3 pv d4e5
info depth 2 seldepth 2 multipv 1 score cp 25 nodes 2 nps 666 hashfull 0 tbhits 26 time 3 pv d4e5
info depth 3 seldepth 2 multipv 1 score cp 25 nodes 3 nps 750 hashfull 0 tbhits 26 time 4 pv d4e5
info depth 4 seldepth 2 multipv 1 score cp 25 nodes 4 nps 1000 hashfull 0 tbhits 26 time 4 pv d4e5
info depth 5 seldepth 3 multipv 1 score cp 25 nodes 11 nps 2750 hashfull 0 tbhits 26 time 4 pv d4e5
info depth 6 seldepth 4 multipv 1 score cp 25 nodes 41 nps 10250 hashfull 0 tbhits 26 time 4 pv d4e5 g6g7 e3g2
info depth 7 seldepth 7 multipv 1 score cp 25 nodes 194 nps 38800 hashfull 0 tbhits 26 time 5 pv d4e5 g6g7 e3f5 g7f8
info depth 8 seldepth 7 multipv 1 score cp 25 nodes 442 nps 88400 hashfull 0 tbhits 26 time 5 pv d4e5 g6g7 e3g2 g7f8 d5c4
info depth 9 seldepth 8 multipv 1 score cp 25 nodes 1119 nps 186500 hashfull 0 tbhits 26 time 6 pv d4e5 g6h6 e5e4 h6g7 e4d3 g7f8
info depth 10 seldepth 9 multipv 1 score cp 25 nodes 1518 nps 253000 hashfull 0 tbhits 26 time 6 pv d4e5 g6g7 e5f5 g7h8
bestmove d4e5 ponder g6g7

Using dtz tablebases

setoption name SyzygyPath value tb345;tb345_dtz
info string Found 145 tablebases
position fen 8/8/6k1/3B4/3K4/4N3/8/8 w - - 54 106
go depth 10
info string NNUE evaluation using nn-ae6a388e4a1a.nnue
info string NNUE evaluation using nn-baff1ede1f90.nnue
info depth 1 seldepth 2 multipv 1 score cp 20000 nodes 1 nps 142 hashfull 0 tbhits 26 time 7 pv d4e5
info depth 2 seldepth 2 multipv 1 score cp 20000 nodes 2 nps 250 hashfull 0 tbhits 26 time 8 pv d4e5
info depth 3 seldepth 2 multipv 1 score cp 20000 nodes 3 nps 375 hashfull 0 tbhits 26 time 8 pv d4e5
info depth 4 seldepth 2 multipv 1 score cp 20000 nodes 4 nps 500 hashfull 0 tbhits 26 time 8 pv d4e5
info depth 5 seldepth 3 multipv 1 score cp 20000 nodes 11 nps 1375 hashfull 0 tbhits 26 time 8 pv d4e5
info depth 6 seldepth 4 multipv 1 score cp 20000 nodes 41 nps 4555 hashfull 0 tbhits 26 time 9 pv d4e5 g6g7 e3g2
info depth 7 seldepth 7 multipv 1 score cp 20000 nodes 194 nps 19400 hashfull 0 tbhits 26 time 10 pv d4e5 g6g7 e3f5 g7f8
info depth 8 seldepth 7 multipv 1 score cp 20000 nodes 442 nps 40181 hashfull 0 tbhits 26 time 11 pv d4e5 g6g7 e3g2 g7f8 d5c4
info depth 9 seldepth 8 multipv 1 score cp 20000 nodes 1119 nps 93250 hashfull 0 tbhits 26 time 12 pv d4e5 g6h6 e5e4 h6g7 e4d3 g7f8
info depth 10 seldepth 9 multipv 1 score cp 20000 nodes 1518 nps 126500 hashfull 0 tbhits 26 time 12 pv d4e5 g6g7 e5f5 g7h8
bestmove d4e5 ponder g6g7

Disservin commented 7 months ago

I can test later but there are some alternative dtz tablebases called "nr" which seemed to fixed the problem for crafty and pere. It makes more sense that the issue was not in Stockfish in the first place. https://tablebase.lichess.ovh/tables/standard/

Mh.. is there some information on how those were generated/fixed/edited? Maybe niklasf knows something?

whelanh commented 7 months ago

As I understand it, dtz tables don't assume a 50 move rule, but dtr does. So neither is wrong, just different assumptions. https://chess.stackexchange.com/questions/28520/what-do-dtm-dtz-dtc-dtr-dtz50-and-dtzr-mean

dav1312 commented 7 months ago

I can test later but there are some alternative dtz tablebases called "nr" which seemed to fixed the problem for crafty and pere. It makes more sense that the issue was not in Stockfish in the first place. tablebase.lichess.ovh/tables/standard

Mh.. is there some information on how those were generated/fixed/edited? Maybe niklasf knows something?

@niklasf said on Discord: some normal dtz tables store rounded values (described on https://syzygy-tables.info/metrics). it's a bit confusing, so i generated no rounding tables, at least up to 6 pieces

robertnurnberg commented 7 months ago

So my understanding of the situation is that sf does not report correct tb results in the affected positions when the user provides the widely available standard 3-6men syzygy files.

Only if the tb files were generated with this patch does sf work correctly.

Is that correct?

If that is so, should we warn the user on TB load if they use outdated tb files (if that is possible, by e.g. probing the position from this issue) ?

And what is the situation for the 7men files?

niklasf commented 7 months ago

To clarify: Stockfish works correctly in game play. That is, after the capture that crosses into tablebase territory, moves will be ranked correctly and Stockfish will achieve the best possible outcome. That's what the tablebases are designed for - they are not broken and they don't need to be fixed.

The issue occurs only in analysis, when setting up arbitrary positions that do not arise from optimal play following a capture. In that case there are really 7 possible probe results:

Loss
Loss or blessed loss
Blessed loss
Draw
Cursed win
Cursed win or win
Win

Avoiding this, by patching the table generator, produces larger tables for no playing strength gain. Handling the ambiguous results will require more code for no playing strength gain.

For the analysis board on Lichess, I did both: 6 piece tables with no rounding, and (because 7 piece tables are too much effort to regenerate) a user interface that correctly displays ambiguous results for 7 piece tables. For Stockfish, I am not sure what to do, if anything.

peregrineshahin commented 7 months ago

It's very hard for me to consider them not broken, moves will be ranked correctly at this point but we stopped caring about optimal play or guiding SF to the correct Win long ago, as SF more or less is expected to find these wins mostly, for me I only care about the eval. Also, it's pretty easy to notice that this must be a bug/laziness/some oversight turned into a feature..

robertnurnberg commented 7 months ago

I believe we should re-open this issue. (At least to highlight to end-users that it exists, and to remind ourselves.)

At present stockfish returns incorrect analysis results for some FENs with nonzero half move counters when it is supplied with the default (and probably most widely used) 6men syzygy EGTBs, or with the only available format for 7men syzygy EGTBs.

vondele commented 7 months ago

I think we can reopen, though niklasf probably answered this quite clearly.

dubslow commented 7 months ago

Also, it's pretty easy to notice that this must be a bug/laziness/some oversight turned into a feature..

This is clearly false, as niklasf repeatedly stated this is specifically and exactly aimed at improving compression and reducing overall TB size. Syzygy was specifically designed to be the smallest TB, and this compression feature is a noticeable step towards this goal (to the tune of several percent, or so I've heard).

I do concur with re-opening in the short run. Altho bestmove selection is unaffected, it is indeed deeply confusing for users to see a winning evaluation for a drawn position.

The best idea I've seen is that we should adjust the evaluation of such ambiguous probes to be less than a proven win. Maybe 100 instead of 200 or something?

peregrineshahin commented 7 months ago

It's impossible to imagine that design wise you need to mess this requirement up while you do everything right. we can produce two/three/four bugs such that the so called Syzygy efficiency is optimized more.

This is clearly false, as niklasf repeatedly stated this is specifically and exactly aimed at improving compression and reducing overall TB size. Syzygy was specifically designed to be the smallest TB, and this compression feature is a noticeable step towards this goal (to the tune of several percent, or so I've heard).

Disservin commented 7 months ago

What is actually returned after probing ? I.e. a WDLCursedWin or a WDLWin ?

Torom commented 7 months ago

If we continue to play the game from the original FEN optimally, we reach 8/8/8/8/6B1/4N3/5K1k/8 w - - 98 128. Giving Stockfish this position we get: info depth 245 seldepth 3 multipv 1 score cp 20000 nodes 490 nps 163333 hashfull 0 tbhits 20 time 3 pv e3f1 h2h1. So we output a two move PV that ends in a 50-move draw, but still output 200.00.

AndyGrant commented 6 months ago

To clarify: Stockfish works correctly in game play. That is, after the capture that crosses into tablebase territory, moves will be ranked correctly and Stockfish will achieve the best possible outcome. That's what the tablebases are designed for - they are not broken and they don't need to be fixed.

For even more clarity @niklasf ... You are saying that Stockfish avoids this. This is because Stockfish will rank the root moves using this code, and refuse to play any root move with a rank worse than optimal?

        // Better moves are ranked higher. Certain wins are ranked equally.
        // Losing moves are ranked equally unless a 50-move draw is in sight.
        int r    = dtz > 0 ? (dtz + cnt50 <= 99 && !rep ? MAX_DTZ : MAX_DTZ - (dtz + cnt50))
                 : dtz < 0 ? (-dtz * 2 + cnt50 < 100 ? -MAX_DTZ : -MAX_DTZ + (-dtz + cnt50))
                           : 0;
        m.tbRank = r;

The important part here is the <= 99 condition, which is intentionally not <= 100, in order to avoid the off-by-one rounding issue (at least when delivering the mate?). Also, the !rep condition is present to serve a similar purpose for accidentally letting a WIN become a CURSED WIN?

Restated: If a repetition has been made, then Stockfish will only play moves with equal-DTZ in winning positions. If a repetition has not been made, then Stockfish will play any move which wins -before- the 100th, avoiding moves which zero on the 100th ply?

I don't explicitly see how this guarantees protection from the stated issue in ALL cases. Can a case exist where all moves appear to have the same DTZ=99 or DTZ=100, and then despite best intentions from above, you end up in the same ambiguous situation? IE all moves fall into the "WIN or CURSED WIN" ambiguous bucket?

Ref: https://github.com/official-stockfish/Stockfish/commit/108f0da4d7f993732aa2e854b8f3fa8ca6d3b46c

niklasf commented 6 months ago

~~@AndyGrant I think you found a bug in Stockfish's ranking.~~

The intended effect is to give Stockfish some freedom, but reliable switch to nearly (i.e. 1 ply may be squandered due to rounding) DTZ-optimal play before it's too late.

This works when we approach the threshold <= 99 and eventually switch. Note that zeroing or mating on half-move clock 100 is still a win.

~~But~~ And there are indeed endgames that are so tight that immediate DTZ-optimal play is required and even losing 1 ply to rounding would change the outcome. Rounding is turned off for these endgames, so we've got precise DTZ values like 99 and 100 on our hands. ~~But Stockfish does not distinguish those by rank.~~

Edit: With regard to rep: After switching to optimal play, we may have to repeat a position from Stockfish's previous failed conversion attempts one more time. That's safe if there's only ever been one repetition, so we better switch to optimal play before allowing a second repetition. ~~But then ranking all moves equally, regardless of DTZ seems wrong.~~

Edit 2: I misread the implementation and it does not have the problem that I think I saw.

AndyGrant commented 6 months ago

Thank you for your response, Niklas. I'll take your word that the below code is intended for this purpose. Seems sensible -- and explains why your NR tables hijacked the line above to get the desired result.

  if (total_stats_w[DRAW_RULE] || total_stats_b[STAT_MATE - DRAW_RULE])
    ply_accurate_w = 1;

Not sure what this all will do to benefit Stockfish, since it seems Stockfish does it all correctly -- but this will fix some things in Ethereal and Torch, and inevitably many more engines in the openbench space as a result of the Ethereal changes.

vondele commented 4 months ago

@niklasf while working on #5414 I think I realized how this could be fixed without TB regeneration. If the TB win is on the edge from the DTZ / r50c, walk the TB moves until mate or draw, and score accordingly.

niklasf commented 4 months ago

That works, but I think it would have to be min-max rather than a linear walk. Among moves with equal DTZ, there may be some that were precise and some that have been rounded.

vondele commented 4 months ago

OK, I see, added complication.

official-stockfish / Stockfish