hans-ekbrand / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
3 stars 1 forks source link

Occasionally fails to actually start search after a new move. #5

Closed hans-ekbrand closed 2 years ago

hans-ekbrand commented 2 years ago

1212 23:02:33.921610 140341671880448 ../../src/mcts/search.h:282] SearchWorker::RunBlocking() entered. 1212 23:02:33.921631 140341671880448 ../../src/mcts/search.h:283] Started search thread. 1212 23:02:33.921648 140341680273152 ../../src/mcts/search.h:282] SearchWorker::RunBlocking() entered. 1212 23:02:33.921707 140341680273152 ../../src/mcts/search.h:283] Started search thread. 1212 23:02:37.461650 140344242099968 ../../src/mcts/stoppers/stoppers.cc:127] Stopping search: Ran out of time.

So nothing in 4 seconds, which indicates that something is locked/messed up, so that search cannot start.

Prior to this, there is a suspect high number of calls to AuxMaybeEnqueueNode() Called AuxMaybeEnqueueNode() 36890 times., which sounds like some node (e.g. root) is repeatedly tried to get queued, but fails to be queued, or that search->stop.load(std::memory_order_acquire) is true, which would get the AuxMaybeEnqueueNode() to fail.

Nothing indicates there is anything wrong with the helper, but I have only seen this with SF8. Might be that SF8 is the fastest helper I have used, and the speed makes something corrupt.

It is suspect that for the last two moves, no evals were made, but that is probably a problem with lc0, not the helper, after search has started, a note from AuxMaybeEnqueueNode() is expected, but that never happened here. Here is what it usually looks like:

1212 23:02:04.386083 140341680273152 ../../src/mcts/auxengine.cc:72] AuxMaybeEnqueueNode() picked node: Term:0 This:0x7fa3b074c740 Parent:0x7fa39c78b780 Index:2 Child:0x7fa39c7c2ee0 Sibling:0x7fa39c064900 WL:0.19754 N:200 N_:4 Edges:31 Bounds:-1,1 Solid:0 for the auxengine_queue.

1212 23:01:57.459413 140345124413888 ../../src/mcts/auxengine.cc:409] Summaries per move: auxengine_queue_ size at the end of search: 274 Average duration 16ms Number of evals 370 Number of added nodes 1762
1212 23:02:04.384055 140345124413888 ../../src/mcts/auxengine.cc:409] Summaries per move: auxengine_queue_ size at the end of search: 90 Average duration 20ms Number of evals 140 Number of added nodes 785
1212 23:02:08.768634 140345124413888 ../../src/mcts/auxengine.cc:409] Summaries per move: auxengine_queue_ size at the end of search: 2 Average duration 8ms Number of evals 196 Number of added nodes 643
1212 23:02:15.287350 140345124413888 ../../src/mcts/auxengine.cc:409] Summaries per move: auxengine_queue_ size at the end of search: 0 Average duration 10ms Number of evals 233 Number of added nodes 841
1212 23:02:20.697288 140345124413888 ../../src/mcts/auxengine.cc:409] Summaries per move: auxengine_queue_ size at the end of search: 1 Average duration 10ms Number of evals 213 Number of added nodes 783
1212 23:02:25.879939 140345124413888 ../../src/mcts/auxengine.cc:409] Summaries per move: auxengine_queue_ size at the end of search: 5 Average duration 21ms Number of evals 75 Number of added nodes 401
1212 23:02:33.920821 140345124413888 ../../src/mcts/auxengine.cc:409] Summaries per move: auxengine_queue_ size at the end of search: 0 Average duration 24ms Number of evals 159 Number of added nodes 1013
1212 23:02:44.497870 140345124413888 ../../src/mcts/auxengine.cc:409] Summaries per move: auxengine_queue_ size at the end of search: 0 Average duration -1ms Number of evals 0 Number of added nodes 0
1212 23:02:51.719834 140345124413888 ../../src/mcts/auxengine.cc:409] Summaries per move: auxengine_queue_ size at the end of search: 0 Average duration -1ms Number of evals 0 Number of added nodes 0

For now I have just moved the selection logic to the caller from AuxMaybeEnqueueNode(), effectively hiding the problem.

I noticed another issue in the log: When adding nodes in PreExtend...(), and the current node is terminal, it will might have zero edges. The current code does seem to handle it correctly, but makes quite a lot of fuzz about it, changed in commit: b2f5cb475617bbcb5324b209afbf68bc2d2ed92b

hans-ekbrand commented 2 years ago

I believe this is solved by the major locking review/rewrite I did recently.