glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
760 stars 301 forks source link

Explore every move twice before normal training self-play search #698

Open Mardak opened 6 years ago

Mardak commented 6 years ago

This is a bit of departure from AZ's Dirichlet noise, but it's still "zero" with an additional kind of "noise" with the hopes of better directing self-play's search by getting a better eval by doing a quick check to see how say white would think of the board for each of the likeliest subsequent white move -- i.e., 2 visits.

In the first CCLS SCTR vs id359 game (lczero black), lczero evaluating white SCTR's position would not consider the winning move Rxh4: https://clips.twitch.tv/NimbleLazyNewtPRChase screen shot 2018-05-31 at 10 38 50 am

./lczero -w id359
position startpos moves d2d4 d7d5 c1f4 g7g6 e2e3 g8f6 c2c4 c7c5 d4c5 f8g7 b1c3 d8a5 c4d5 f6d5 d1d5 g7c3 b2c3 a5c3 e1e2 c3a1 f4e5 a1b1 e5h8 c8e6 d5d3 b1a2 e2f3 f7f6 h8g7 b8d7 f3g3 a8c8 c5c6 c8c6 d3d4 c6d6 d4b4 d6b6 b4h4 d7c5 h2h3 b6b2 g1e2 a2d5 g3h2 d5e5 e2g3 h7h5 h4d4 e5d4 e3d4 c5b3 g7h6 h5h4 g3e4 g6g5 f1d3 b3d4 h1a1 a7a6 e4c5 b2f2 d3e4 e6f5 e4b7 f2c2 a1a4 d4e2 c5e4 f5e4 b7e4 c2c1 e4d3 e2f4 d3a6 f4h5
go nodes 8000
info string   Rf4 ->       0   (V: 61.48%) (N:  0.17%) PV: Rf4 
info string  Rxh4 ->       0   (V: 61.48%) (N:  0.33%) PV: Rxh4 
info string   Bf1 ->       1   (V: 12.60%) (N:  0.72%) PV: Bf1 Rxf1
info string   Bf8 ->       1   (V: 17.43%) (N:  0.39%) PV: Bf8 Kxf8
info string   Bc8 ->       1   (V: 17.88%) (N:  0.36%) PV: Bc8 Rxc8
info string   Ra1 ->       1   (V: 17.90%) (N:  0.50%) PV: Ra1 Rxa1
info string   Bg7 ->       1   (V: 22.25%) (N:  0.39%) PV: Bg7 Nxg7
info string   Re4 ->       1   (V: 45.71%) (N:  0.61%) PV: Re4 Ng3
info string   Ra5 ->       2   (V: 34.07%) (N:  0.71%) PV: Ra5 Ng3 Bb5+
info string   Rd4 ->       2   (V: 34.34%) (N:  0.88%) PV: Rd4 Ng3 Bb5+
info string   Ra3 ->       3   (V: 35.48%) (N:  0.76%) PV: Ra3 Ng3 Rxg3 hxg3+
info string   Bb7 ->       4   (V: 29.36%) (N:  2.51%) PV: Bb7 Ng3 Ra8+ Kd7 Bc8+
info string   Rb4 ->       4   (V: 31.39%) (N:  1.17%) PV: Rb4 Ng3 Bb5+ Kf7 Bxg5
info string   Ra2 ->       4   (V: 31.98%) (N:  1.06%) PV: Ra2 Ng3 Bb5+ Kf7 Bxg5
info string   Rg4 ->       7   (V: 34.97%) (N:  1.78%) PV: Rg4 Ng3 Rxg3 hxg3+ Kxg3 Kf7 h4 Rc3+
info string   Bc4 ->       8   (V: 26.98%) (N:  5.49%) PV: Bc4 Ng3 Bb5+ Kf7 Bc4+ Kg6 Bxg5
info string  Bxg5 ->      10   (V: 32.77%) (N:  3.35%) PV: Bxg5 fxg5 Ra5 Ng3 Bb5+ Kf7 Bf1
info string   Bd3 ->      11   (V: 33.66%) (N:  3.54%) PV: Bd3 Ng3 Bg6+ Kd7 Rd4+ Ke6 Re4+
info string    g4 ->      28   (V: 36.22%) (N:  2.12%) PV: g4 hxg3+ Kg2 Rc2+ Kf3 Rf2+ Ke3 Nf4 h4 Ng2+ Kd3
info string    g3 ->      44   (V: 37.00%) (N:  1.79%) PV: g3 hxg3+ Kg2 Rc2+ Kf3 Rf2+ Ke3 Nf4 h4 Ng2+ Kd3 Nxh4
info string  Bb5+ ->      65   (V: 30.34%) (N: 39.66%) PV: Bb5+ Kf7 Rc4 Rb1 Rc5 Ng3 Bc4+ Kg6 Bxg5 Rh1+
info string   Rc4 ->     517   (V: 40.06%) (N:  7.90%) PV: Rc4 Rb1 Rc8+ Kd7 Rc3 Ng3 Rxg3 hxg3+ Kxg3 Rb4 Be2 Ke6 Bf8 Kf7 Bh6 Kg6 Bf8 Kf7 Bh6 Kg6 Bf8
info string   Be2 ->    3903   (V: 58.33%) (N: 23.83%) PV: Be2 Nf4 Bf3 Kf7 Rxf4 gxf4 Bxf4 Rc3 Bd2 Rd3 Be1 Rd4 Bh5+ Kg7 Bg4 Kg6 Bxh4 f5 Bf3
info string stm White winrate 55.20%

Even with noise, it's unlikely to find:

./lczero -w id359 -n
…
info string  Rxh4 ->       1   (V: 30.53%) (N:  0.92%) PV: Rxh4 gxh4
…
info string   Be2 ->    3904   (V: 58.33%) (N: 21.73%) PV: Be2 Nf4 Bf3 Kf7 Rxf4 gxf4 Bxf4 Rc3 Bd2 Rd3 Be1 Rd4 Bh5+ Kg7 Bg4 Kg6 Bxh4 f5 Bf3
info string stm White winrate 54.80%

However, forcing 2 visits with something like:

diff --git a/src/UCTNode.cpp b/src/UCTNode.cpp
--- a/src/UCTNode.cpp
+++ b/src/UCTNode.cpp
@@ -370,10 +370,15 @@ UCTNode* UCTNode::uct_select_child(Color color, bool is_root) {
     for (const auto& child : m_children) {
         if (!child->active()) {
             continue;
         }

+        // For training, visit each root move twice to get a better initial eval
+        if (is_root && cfg_noise && child->get_visits() < 2) {
+            return child.get();
+        }
+
         float winrate = fpu_eval;
         if (child->get_visits() > 0) {
             winrate = child->get_eval(color);
         }
         auto psa = child->get_score();

…quickly finds the move:

./lczero-twice -w id359 -n
…
go nodes 80
info string   Rf4 ->       2   (V:  6.37%) (N:  0.50%) PV: Rf4 Nxf4 Bb5+
info string   Bf8 ->       2   (V: 14.84%) (N:  2.37%) PV: Bf8 Kxf8 Bb7
info string   Bc8 ->       2   (V: 15.59%) (N:  0.46%) PV: Bc8 Rxc8 Kg1
info string   Bg7 ->       2   (V: 19.93%) (N:  0.30%) PV: Bg7 Nxg7 Bc4
info string    g3 ->       2   (V: 45.20%) (N:  2.46%) PV: g3 hxg3+ Kg2
info string  Rxh4 ->       2   (V: 55.38%) (N:  6.66%) PV: Rxh4 gxh4 Bxc1
info string   Bf1 ->       3   (V:  9.65%) (N:  0.63%) PV: Bf1 Rxf1 Ra8+ Kf7
info string   Ra1 ->       3   (V: 10.29%) (N:  2.83%) PV: Ra1 Rxa1 Bb5+ Kf7
info string   Re4 ->       3   (V: 31.15%) (N:  0.47%) PV: Re4 Ng3 Bb5+ Kf7
info string   Bb7 ->       3   (V: 34.53%) (N:  8.61%) PV: Bb7 Ng3 Ra8+ Kd7
info string   Rd4 ->       3   (V: 34.85%) (N:  0.66%) PV: Rd4 Ng3 Bb5+ Kf7
info string   Ra3 ->       3   (V: 35.48%) (N:  0.57%) PV: Ra3 Ng3 Rxg3 hxg3+
info string   Rb4 ->       3   (V: 35.78%) (N:  0.88%) PV: Rb4 Ng3 Bb5+ Kf7
info string   Rg4 ->       3   (V: 35.98%) (N:  2.69%) PV: Rg4 Ng3 Rxg3 hxg3+
info string   Ra5 ->       3   (V: 36.43%) (N:  2.30%) PV: Ra5 Ng3 Bb5+ Kf7
info string   Ra2 ->       3   (V: 36.58%) (N:  0.80%) PV: Ra2 Ng3 Bb5+ Kf7
info string    g4 ->       3   (V: 39.15%) (N:  1.70%) PV: g4 hxg3+ Kg2 Rc2+
info string   Bd3 ->       3   (V: 40.53%) (N:  2.80%) PV: Bd3 Ng3 Bg6+ Kd7
info string   Be2 ->       3   (V: 40.78%) (N: 18.30%) PV: Be2 Ng3 Bb5+ Kf7
info string   Bc4 ->       3   (V: 42.25%) (N:  5.43%) PV: Bc4 Ng3 Bb5+ Kf7
info string  Bxg5 ->       3   (V: 47.75%) (N:  2.91%) PV: Bxg5 fxg5 Ra5 Ng3
info string   Rc4 ->       4   (V: 59.62%) (N:  5.92%) PV: Rc4 Rxc4 Bxc4 Ng3
info string  Bb5+ ->      12   (V: 55.12%) (N: 29.75%) PV: Bb5+ Kf7 Bxg5 fxg5 Rc4
info string stm White winrate 37.97%

go nodes 800
…
info string   Rc4 ->      13   (V: 55.54%) (N:  8.03%) PV: Rc4 Rb1 Rc8+ Kd7 Rc2 Ng3
info string  Bb5+ ->      20   (V: 47.51%) (N: 33.84%) PV: Bb5+ Kf7 Rc4 Rb1 Ba4 Ng3
info string  Rxh4 ->     366   (V: 88.55%) (N:  0.54%) PV: Rxh4 gxh4 Bxc1 Ng3 Bc4 Kd7 Bf4 Nf5 g3 e5
info string stm White winrate 78.40%

At least for this position, the 2 visits for Rxh4 was enough for normal search to drive the majority of visits to the move, and in this case, training data would learn to boost prior closer to 80% instead of the current 0.33%.

@jkiliani has pointed out in https://github.com/gcp/leela-zero/issues/1408#issuecomment-388626060 that with lower visits, we definitely should be careful about subtracting these forced visits from training data if search determined it shouldn't put any more visits to the move. Also, unclear if these forced visits should count towards the total visits to stop searching.

Mardak commented 6 years ago

It sounds like for chess, there's typically at most ~50 legal moves, so forcing 2 visits per move out of 800 might not be too bad at 10% of visits. Whereas for go, there could be ~300 legal moves, so even with 3200 visits is close to 20%.

In particular for training data if not removing the forced moves, a should-be-100% prior move would appear as 90% visits in training data if 10% were given to other moves.

Mardak commented 6 years ago

Here's some more CCLS games when running with ./lczero-twice -w id351 -n then position … and go nodes 800 where the prior for the opponent's move was extremely low, so forcing 2 visits allows it to be searched:

CCLS Season 3 - Elite League /// Leela Chess Zero v0.10 ID 351 Gauntlet GTX https://www.twitch.tv/videos/267202896

screen shot 2018-05-31 at 10 41 38 am

id351 vs EXchess game 1
position startpos moves g1f3 g8f6 g2g3 e7e6 f1g2 f8e7 c2c4 d7d5 e1g1 e8g8 d2d4 d5c4 f3e5 c7c5 d4c5 d8c7 e5c4 c7c5 b2b3 f8d8 b1d2 c5c7 c1b2 b8c6 a1c1 a8b8 a2a3 f6d5 b3b4 b7b5 c4a5 c8b7 c1c2 e7f8 d1b1 b8c8 f1c1 c7d7 a5b7 d7b7 d2b3 a7a6 e2e3 b7d7 c2d2 d7e7 h2h4 e7b7 b3c5 b7a8 c5e4 h7h6 e4c5 a6a5 c5e6 f7e6 b1g6 d8d6 g2e4 c6e7 g6h7 g8f7 c1d1 a5b4 e4f3

info string   Re8 ->       2   (V: 15.82%) (N:  0.12%) PV: Re8 Bh5+ Ng6
info string   Qa5 ->       2   (V: 17.50%) (N:  2.64%) PV: Qa5 Bh5+ Ng6
info string   Ng8 ->       2   (V: 19.34%) (N:  0.32%) PV: Ng8 Bh5+ Ke7
info string   Qa4 ->       2   (V: 20.47%) (N:  4.02%) PV: Qa4 Bh5+ Ng6
info string   Qb7 ->       2   (V: 20.89%) (N:  0.99%) PV: Qb7 Bh5+ Ng6
info string   Rd7 ->       2   (V: 21.46%) (N:  0.44%) PV: Rd7 Bh5+ Ng6
info string   Nc6 ->       2   (V: 22.91%) (N:  0.13%) PV: Nc6 Bh5+ Ke7
info string   Qb8 ->       2   (V: 23.01%) (N:  0.80%) PV: Qb8 Bh5+ Ng6
info string  Rcd8 ->       2   (V: 24.84%) (N:  1.95%) PV: Rcd8 Bh5+ Ng6
info string   Nc7 ->       3   (V: 14.34%) (N:  3.58%) PV: Nc7 Bxa8 Rxd2 Rxd2
info string   Nf4 ->       3   (V: 14.90%) (N:  0.12%) PV: Nf4 Bxa8 Rxd2 Rxd2
info string   Nb6 ->       3   (V: 15.00%) (N:  0.19%) PV: Nb6 Bxa8 Rxd2 Rxd2
info string   Rc1 ->       3   (V: 17.67%) (N:  0.04%) PV: Rc1 Rxc1 bxa3 Bh5+
info string  Qxa3 ->       3   (V: 17.81%) (N:  0.50%) PV: Qxa3 Bxa3 bxa3 Bh5+
info string   Rc2 ->       3   (V: 17.98%) (N:  0.04%) PV: Rc2 Rxc2 bxa3 Bh5+
info string   Rc7 ->       3   (V: 18.12%) (N:  0.20%) PV: Rc7 Bh5+ Ng6 Qxg6+
info string  Nxe3 ->       3   (V: 18.56%) (N:  0.16%) PV: Nxe3 Bh5+ Ng6 Qxg6+
info string   Qa7 ->       3   (V: 18.57%) (N:  0.38%) PV: Qa7 Bh5+ Ng6 Qxg6+
info string    e5 ->       3   (V: 20.05%) (N:  0.49%) PV: e5 Bxe5 Rcd8 Bxd6
info string    h5 ->       3   (V: 20.12%) (N:  0.58%) PV: h5 axb4 Qa4 Bxh5+
info string    b3 ->       3   (V: 20.82%) (N:  1.64%) PV: b3 Bh5+ Ng6 Qxg6+
info string   Nc3 ->       3   (V: 26.96%) (N:  0.22%) PV: Nc3 Bxa8 Nxd1 Rxd6
info string   Rb8 ->       3   (V: 27.33%) (N:  0.30%) PV: Rb8 axb4 Rbd8 Bh5+
info string   Rc5 ->       3   (V: 27.50%) (N:  0.21%) PV: Rc5 axb4 Rc4 e4
info string   Ng6 ->       3   (V: 27.60%) (N:  4.19%) PV: Ng6 axb4 Nge7 Bh5+
info string   Qc6 ->       3   (V: 29.26%) (N:  3.59%) PV: Qc6 axb4 Qc4 Bh5+
info string   Nf5 ->       3   (V: 29.26%) (N:  1.10%) PV: Nf5 Bh5+ Ke7 e4
info string   Qa6 ->       3   (V: 29.89%) (N:  2.87%) PV: Qa6 axb4 Rcd8 Bh5+
info string   Ke8 ->       3   (V: 30.07%) (N:  1.64%) PV: Ke8 axb4 Qa4 Bh5+
info string   Rc4 ->       3   (V: 31.22%) (N:  3.05%) PV: Rc4 axb4 Qc8 e4
info string   Rb6 ->       3   (V: 31.95%) (N:  0.68%) PV: Rb6 axb4 Ra6 Bh5+
info string  Rdc6 ->       3   (V: 31.97%) (N:  0.32%) PV: Rdc6 axb4 Qa4 Bh5+
info string  Rdd8 ->       3   (V: 32.21%) (N:  0.40%) PV: Rdd8 axb4 Qa4 Bh5+
info string  Rcc6 ->       3   (V: 32.35%) (N:  0.86%) PV: Rcc6 axb4 Qc8 Bh5+
info string   Rc3 ->       3   (V: 32.89%) (N:  1.77%) PV: Rc3 axb4 Rc4 e4
info string   Ra6 ->       3   (V: 34.36%) (N:  3.10%) PV: Ra6 axb4 Ra2 Bh5+
info string  bxa3 ->      22   (V: 28.91%) (N: 56.33%) PV: bxa3 Bh5+ Ng6 Bxg6+ Ke7 Bxg7 Kd7 Be5+ Be7
info string   Nf6 ->     352   (V: 79.84%) (N:  0.05%) PV: Nf6 Bxa8 Rxd2 Rxd2 Nxh7 Bf3 bxa3 Bxa3 Nf6 Rb2 Rb8 Bd6 Rb6 Bc7 Ra6 Rxb5 Ned5
info string stm Black winrate 65.75%

screen shot 2018-05-31 at 11 10 17 am

id351 vs Hakkapeliitta game 1
position startpos moves e2e4 c7c5 g1f3 e7e6 d2d4 c5d4 f3d4 b8c6 b1c3 g8f6 d4c6 b7c6 e4e5 f6d5 c3e4 d8c7 f2f4 c7b6 a2a3 f8e7 c2c4 d5e3 d1d3 e3f1 h1f1 c6c5 f1f2 f7f5 e4d6 e7d6 d3d6 b6d6 e5d6 e8f7 b2b4 c8a6 b4b5 a6b7 a3a4 h7h5 a4a5 h5h4 a5a6 b7e4 c1e3 h4h3 g2g3 h8c8 f2a2 a8b8 a1c1 f7g6 e1f1 g6h5 f1f2 h5g4 a2e2 b8b6 e2d2 e4f3 c1c3 g7g6 c3c1 f3g2 c1c3 g2f3 c3a3 f3e4 a3a1 e4f3 a1c1 f3g2 c1a1 g2e4 a1a3 e4f3 a3c3 f3e4 f2g1 e4f3 c3d3 b6b8 d3c3 f3e4 g1f2 b8b6 f2e2 e4g2 e3g1 g2e4 e2e1 e4f3 d2d3 f3g2 e1e2 g2e4 d3d2 b6b8 d2a2 b8b6 a2d2 b6b8 g1e3 b8b6 c3c1 e4f3 e2f1 b6b8 f1f2 b8b6 c1e1 f3e4 e1d1 e4f3 d1a1 b6b8 f2g1 f3e4 g1f2 b8b6 a1a2 e4f3 a2b2 b6b8 b2b3 b8b6 b3d3 f3e4 d3b3 e4f3 b3b1 f3e4 b1e1 e4f3 f2g1 f3e4 e1f1 e4g2 f1e1 g2f3 g1f2 f3e4 e1f1 e4f3 f2g1 f3e4 g1f2 e4f3 f1g1 f3e4 g1d1 e4f3 f2e1

info string    e5 ->       2   (V: 23.71%) (N:  0.18%) PV: e5 fxe5 Bxd1
info string  Rcc6 ->       2   (V: 26.86%) (N:  1.25%) PV: Rcc6 bxc6 Rxc6
info string   Bb7 ->       2   (V: 29.86%) (N:  2.96%) PV: Bb7 axb7 Rxb7
info string   Rh8 ->       2   (V: 35.81%) (N:  0.01%) PV: Rh8 Bxc5 Bxd1
info string   Ra8 ->       2   (V: 36.23%) (N:  1.25%) PV: Ra8 Bxc5 Bxd1
info string   Rg8 ->       2   (V: 36.59%) (N:  0.05%) PV: Rg8 Bxc5 Bxd1
info string   Re8 ->       2   (V: 38.72%) (N:  0.02%) PV: Re8 Bxc5 Bxd1
info string   Ba8 ->       2   (V: 45.76%) (N:  0.04%) PV: Ba8 Ke2 Rcb8
info string   Be2 ->       2   (V: 46.09%) (N:  0.18%) PV: Be2 Bxc5 Rxc5
info string  Rxd6 ->       3   (V:  7.95%) (N:  1.35%) PV: Rxd6 Rxd6 Bxd1 Rxd1
info string  Rxb5 ->       3   (V: 14.47%) (N:  1.89%) PV: Rxb5 cxb5 Bxd1 Rxd1
info string   Rb7 ->       3   (V: 17.79%) (N:  0.37%) PV: Rb7 axb7 Bxb7 Ra1
info string  Rbc6 ->       3   (V: 18.47%) (N:  2.89%) PV: Rbc6 bxc6 Bxc6 Kf2
info string   Rc7 ->       3   (V: 18.98%) (N:  5.79%) PV: Rc7 dxc7 Bxd1 Rxd1
info string    g5 ->       3   (V: 24.52%) (N:  1.74%) PV: g5 Rc1 gxf4
info string   Rf8 ->       3   (V: 27.02%) (N:  5.09%) PV: Rf8 Bxc5 Bxd1 Bxb6
info string   Bc6 ->       3   (V: 28.07%) (N:  2.29%) PV: Bc6 bxc6 Rbxc6 Kf2
info string   Bh1 ->       3   (V: 39.29%) (N:  2.26%) PV: Bh1 Ke2 Rcb8 Bf2
info string   Be4 ->       3   (V: 41.37%) (N:  0.60%) PV: Be4 Ke2 Rcb8 Bf2
info string   Bg2 ->       3   (V: 42.27%) (N:  0.44%) PV: Bg2 Ke2 Rcb8 Bf2
info string   Rd8 ->       4   (V: 23.94%) (N:  7.68%) PV: Rd8 Bxc5 Rbb8 Rc1
info string  Rbb8 ->       4   (V: 42.26%) (N:  0.61%) PV: Rbb8 Bf2 Rc7 dxc7
info string   Kh5 ->       4   (V: 45.17%) (N:  0.76%) PV: Kh5 g4+ Kxg4 Rc1
info string  Rxa6 ->      10   (V: 21.05%) (N: 23.09%) PV: Rxa6 bxa6 Bxd1 Rxd1 Rc6 Ke2 Rxa6 Bxc5
info string   Bd5 ->      10   (V: 46.75%) (N:  0.21%) PV: Bd5 Bf2 Rcb8 Bg1 e5
info string  Rcb8 ->      19   (V: 25.37%) (N: 36.73%) PV: Rcb8 Bxc5 Bxd1 Rxd1 Rc8 Bxb6 axb6 c5 Rxc5
info string  Bxd1 ->     360   (V: 49.82%) (N:  0.27%) PV: Bxd1 Rxd1 Kf3 Bf2 Rbb8 Kf1 Ra8 Rd3+ Ke4 Ke2 e5 fxe5 Kxe5 Rd5+ Ke6 Bxc5 g5 Kd3 Rf8
info string stm Black winrate 45.49%

screen shot 2018-05-31 at 12 03 37 pm

iCE vs id351 game 1
position startpos moves e2e4 c7c6 g1f3 d7d5 e4e5 c6c5 f1e2 b8c6 e1g1 c8g4 c2c4 d5c4 b1a3 e7e6 a3c4 f8e7 d2d3 g8h6 c1h6 g7h6 d1d2 h6h5 d2f4 h8g8 f1e1 d8d7 a1d1 e8c8 f4f7 h7h6 f7h7 h5h4 h7h6 d8f8 h6h7 c8b8 c4e3 g4f3 e2f3 c6e5 f3e4 e5f7 f2f4 d7c7 e1f1 e7f6 d1e1 f7d6 h7c7 b8c7 b2b3 f6d4 g1h1 b7b5 e1e2 a7a5 e3c2 d4b2 e4f3 c7d7 c2e3 b2d4 a2a4 b5a4 b3a4 f8f4

info string   Rb2 ->       2   (V:  5.78%) (N:  0.04%) PV: Rb2 Bxb2 Rb1
info string   Ra1 ->       2   (V:  7.16%) (N:  0.44%) PV: Ra1 Bxa1 h3
info string   Rd2 ->       2   (V:  8.20%) (N:  1.45%) PV: Rd2 Bxe3 Re2
info string   Ra2 ->       2   (V:  8.42%) (N:  0.03%) PV: Ra2 Bxe3 Re1
info string  Ref2 ->       2   (V:  8.94%) (N:  0.07%) PV: Ref2 Bxe3 Re2
info string   Ba8 ->       2   (V:  9.29%) (N:  0.35%) PV: Ba8 Rxf1+ Nxf1
info string   Rc2 ->       2   (V:  9.36%) (N:  0.48%) PV: Rc2 Bxe3 Re1
info string   Nf5 ->       2   (V:  9.49%) (N:  0.81%) PV: Nf5 Nxf5 Rfe1
info string   Bb7 ->       2   (V: 10.89%) (N:  0.47%) PV: Bb7 Rxf1+ Nxf1
info string   Bd5 ->       2   (V: 12.00%) (N:  6.27%) PV: Bd5 Rxf1+ Nxf1
info string   Be4 ->       2   (V: 14.71%) (N:  0.19%) PV: Be4 Rxf1+ Nxf1
info string   Bg4 ->       2   (V: 14.87%) (N:  0.50%) PV: Bg4 Rxf1+ Nxf1
info string   Ng4 ->       2   (V: 16.25%) (N:  1.88%) PV: Ng4 Rgxg4 Bxg4
info string   Nd5 ->       2   (V: 16.69%) (N:  1.61%) PV: Nd5 exd5 Bxd5
info string  Rff2 ->       2   (V: 17.90%) (N:  0.33%) PV: Rff2 Rb8 Rf1
info string    g4 ->       2   (V: 18.32%) (N:  0.80%) PV: g4 h3 Nc2
info string   Bh5 ->       2   (V: 19.33%) (N:  0.17%) PV: Bh5 Rxf1+ Nxf1
info string   Kg1 ->       2   (V: 20.09%) (N:  1.90%) PV: Kg1 Rxf3 Rxf3
info string   Rc1 ->       2   (V: 22.75%) (N:  1.81%) PV: Rc1 h3 Nc2
info string   Rg1 ->       2   (V: 24.19%) (N:  0.30%) PV: Rg1 h3 Rf1
info string   Nd1 ->       2   (V: 24.95%) (N:  0.38%) PV: Nd1 Rgf8 Ne3
info string   Rd1 ->       2   (V: 24.95%) (N:  1.00%) PV: Rd1 h3 Nc2
info string    g3 ->       2   (V: 25.96%) (N:  1.31%) PV: g3 hxg3 Bg2
info string  Ree1 ->       2   (V: 32.81%) (N:  4.58%) PV: Ree1 Rgf8 Nc4
info string  Rfe1 ->       3   (V: 25.55%) (N:  9.17%) PV: Rfe1 h3 Nc2 hxg2+
info string    h3 ->       3   (V: 27.96%) (N:  7.26%) PV: h3 Rgf8 Rb1
info string   Rb1 ->       4   (V: 29.83%) (N: 11.13%) PV: Rb1 h3 Nc2 hxg2+ Bxg2
info string   Nc4 ->       7   (V: 31.62%) (N: 17.86%) PV: Nc4 Nxc4 dxc4 Rb8 g3 Rf6
info string   Nc2 ->      11   (V: 31.42%) (N: 27.31%) PV: Nc2 Rgf8 Rb1 h3 Nxd4 Rxd4 Re3
info string  Bc6+ ->     369   (V: 61.86%) (N:  0.06%) PV: Bc6+ Kxc6 Rxf4 Rb8 Nc2 Bc3 Ne3 Rb1+ Rf1 Rb4 Nc4 Nxc4 dxc4 e5 g4
info string stm White winrate 54.87%

Slightly different where lczero didn't consider a better move for itself: screen shot 2018-05-31 at 12 38 20 pm

id351 vs Bobcat game 2
position startpos moves d2d4 d7d5 g1f3 c7c6 c2c4 g8f6 b1c3 d5c4 a2a4 c8f5 e2e3 e7e6 f1c4 b8d7 d1b3 d8b6 a4a5 b6b3 c4b3 f5d3 b3d1 f8d6 d1e2 d3g6 e1g1 e8g8 c1d2 h7h6 f1c1 a7a6 c3a4 f6e4 d2e1 f8e8 g1f1 a8d8 f3d2 e4d2 e1d2 e6e5 d4e5 d7e5 d2c3 e5d7 c1d1 d6e7 a1c1 d7f6 c3d4 f6d7 d4c3 d7f6 c3d4 f6d7 h2h3 g6f5 e2d3 f5e6 d3c4 e6f5 f2f3 c6c5 d4c3 e7g5 g2g4 f5e6 c4e6 e8e6 f3f4 g5e7 f1e2 e6c6 d1d5 c6d6 d5d6 e7d6 c1d1 d6e7 b2b3 f7f6 h3h4 g8f7 h4h5 f7e8 e3e4 d8c8 e2d3 c8c6 d3c4 c6e6 d1e1 e6c6 e4e5 f6e5 f4e5 d7f8 a4b6 f8e6 c4d5 e6c7 d5e4 e8f7 e4f5 g7g6 f5e4 c7b5 c3d2 b5d4 e1b1 d4e2 b1f1 f7e8 f1f3 g6h5 g4h5 e2d4 f3g3 e7f8 b6c4 e8f7 e4d5 d4b5 g3d3 f7e8 d2e3 c6c7 c4d6 b5d6 e5d6 f8d6

info string  Bxc5 ->       2   (V: 20.39%) (N:  2.83%) PV: Bxc5 Rxc5+ Kxd6
info string   Bg5 ->       2   (V: 39.42%) (N:  3.49%) PV: Bg5 hxg5 Kxd6
info string   Rd2 ->       2   (V: 57.29%) (N:  0.18%) PV: Rd2 Bf8 Rf2
info string   Bf2 ->       2   (V: 58.96%) (N:  0.06%) PV: Bf2 Be7 Rg3
info string   Rd4 ->       3   (V:  5.77%) (N:  0.03%) PV: Rd4 cxd4 Kxd6 Rd7+
info string   Bf4 ->       3   (V: 11.89%) (N:  0.05%) PV: Bf4 Bxf4 Rf3 Bd2
info string   Bd4 ->       3   (V: 24.87%) (N:  2.02%) PV: Bd4 cxd4 Kxd6 Rc3
info string   Ke6 ->       3   (V: 47.23%) (N:  0.50%) PV: Ke6 Bf8 Bf4 Rc6+
info string   Bc1 ->       3   (V: 52.15%) (N:  1.99%) PV: Bc1 Bf8 Re3+ Kf7
info string   Bg1 ->       3   (V: 55.71%) (N:  0.05%) PV: Bg1 Be7 Be3 Rd7+
info string   Bd2 ->       3   (V: 56.78%) (N:  0.91%) PV: Bd2 Bf8 Kc4 Rf7
info string   Rd1 ->       3   (V: 57.15%) (N:  2.11%) PV: Rd1 Bf8 Rf1 Rd7+
info string   Ke4 ->       3   (V: 57.91%) (N:  0.13%) PV: Ke4 Bf8 Rd5 c4
info string   Kc4 ->       3   (V: 61.95%) (N:  0.09%) PV: Kc4 Bf8 Rd5 Rc6
info string   Rc3 ->       6   (V: 56.72%) (N:  4.21%) PV: Rc3 Bf8 Bxc5 Rd7+ Ke6
info string    b4 ->      10   (V: 55.67%) (N:  5.88%) PV: b4 c4 Kxd6 Rd7+ Kc5
info string  Kxd6 ->      24   (V: 23.68%) (N: 75.35%) PV: Kxd6 Rd7+ Kxc5 Rxd3 Bxh6 Rxb3 Bd2 Rb5+ Kd6 Rxh5 Kc7
info string  Bxh6 ->     375   (V: 64.52%) (N:  0.10%) PV: Bxh6 c4 bxc4 Bb4 Be3 Bxa5 c5 Rd7+ Kc4 Rxd3 Kxd3 Kf7 Kc4 Bd8 Kd5 a5 Kc4
info string stm White winrate 60.34%
DaghN commented 6 years ago

Hi, interesting idea! If I understand the idea correctly, the TLDR; version of the idea is to give each root move 2 forced visits, but subtract these visits from the training count.

I think this might be a good idea. I also think, though, that this points to a larger issue of, how can we improve the search? Basically, if forcing a 2-N visit of each root move is "worth it" after 800 nodes, then this is evidence that the default Alpha Zero search is easy to improve. (And why wouldn't it be?)

Alpha-Beta search in traditional chess engines has gone through many, many tweaks and additions over the years (null move, reductions, extensions, quiescence search, and I am sure many other things). It seems self-evident that a foundation of PUCT search also can be improved upon immensely. In fact, didn't the Komodo team claim that they added 300 Elo by improving the PUCT search?

I think this is something we should put more thinking into and I encourage more thinking and also testing of your idea to get things started. Sooner or later there will too much Elo to pick up for us to ignore this because of "zero purity" philosophy.

DaghN commented 6 years ago

Let me add that while we should certainly respect the expertise the Google team, it would also be naive to think that they "figured everything out" already. Remember Deep Blue, it was also a rather primitive engine/search which was later surpassed by far by commercial engines with new search ideas.

DaghN commented 6 years ago

Thinking some more, I see a problem that I think warrants some consideration.

Say we have 10 moves that the policy thinks are "crap", and it gives them a crap 0.8%. One of them is actually strong Rxh4, and a 2 visit would reveal it, but the policy has not learned to identify this promising move. The other 9 moves are crap, but they DO kinda warrant a 1-visit to verify they are crap in an 800 search.

If we now give each move two visits for free, we would indeed help the net to search this Rxh4 move more and learn to identify its promising features on its own it over time. But, we would also teach the network that it doesn't need to spend even 1 node on the other 9 crap moves (because the 2 free nodes already checked them out and showed they were not worth further search) even though they actually warrant a 1-visit from a global "let's be sure" perspective.

So we might speed up learning to identify rare tactical shots, but we would also teach the net to not check out the average random crap move often enough. (unless we also used the same "free" 2-search in match setting).

Mardak commented 6 years ago

we would also teach the net to not check out the average random crap move often enough

That's where we are now, and that's MCTS working as intended. If a move is truly bad, search knows not to waste time even visiting it. If a move is only sometimes bad, the training data should increase the prior to levels that are searchable for match settings.

DaghN commented 6 years ago

Let's say we have 10 moves who warrant 15 visits, 1 each and then 5 more on a move that "randomly" warrants a bit more checking out (we cannot teach the policy everything, and we assume the net cannot tell without a visit) .

If we give each move 2 free visits, we would teach the net it only needs to spend on average 4-5 visits on these 10 moves instead of the 10-15 they actually warrant. So it would not even give them 1 visit.

Videodr0me commented 6 years ago

I tried a lot of these schemes at root and throughout the tree, unfortunately in self play they are always hugely inferior. Just try the selfplay option of LC0 to test your approach (min 1000 games). For example with lc0-cudnn selfplay --parallelism=8 --backend=multiplexing "--backend-opts=cudnn(threads=2)" --games=10000 --visits=100 --temperature=1 --tempdecay-moves=10 player1: --your-modification=1 -player2: --your-modification=0

If you just want to find tactics, then this works, you can also tweak search by making these changes progressively less invasive if you go further down the tree - but then again, this almost never translates to improvement in strength.

Mardak commented 6 years ago

Inferior when playing against itself without the changes? That's expected similar to how playing against itself with one side picking most visited moves vs another side picking moves proportionally to visits.

ASilver commented 6 years ago

Why not simply increase the PUCT value?

Mardak commented 6 years ago

I'm assuming you mean specially increasing puct at noised root -- yes, it will probably increase the likelihood of exploring a noised move when other moves look bad. Notably, puct scales up the prior for all moves and reduces the impact of the win rate eval. Any particular puct numbers you think I should run to report in #699?

ASilver commented 6 years ago

For LC0, yes, 3.1 as PUCT value. I have been running various time controls and conditions and one value has been consistent: 3.1

Ishinoshita commented 6 years ago

Why not simply increase the PUCT value?

Seems that would work with the 'vanilla' UCT formula (un-biased UCB term). But with AG-liked formula (AGZ/AZ, LZ and LCZ), policy bias is injected is the UCB term in a multiplicative manner. Thus high PUCT value cannot compensate very low prior.

ASilver commented 6 years ago

Not alone no. I ran a long CLOP with three settings, and it came up with the following values after 710 trials: PUCT: 3.4 FPU: 0.9 Policy Softmax: 2.2

I then ran a match against outside engines with the default settings and these and these showed a 63 Elo increase. Out of curiosity, I also tested them on a revised version of the WAC tactics suite, and the default settings solved 109/200 and these solved 159/200. In other words, they are not only stronger in playing, but are also vastly better in tactics.