glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
760 stars 299 forks source link

Issues with depths greater than 32? #392

Open cn4750 opened 6 years ago

cn4750 commented 6 years ago

After increasing MAX_TREE_SIZE to enable depths greater than 29, I cannot seem to reach depth 33 no matter how long I give it. https://pastebin.com/k3g9n2iU Memory usage was growing towards 30 GB then dropped below 4 GB at some point but didn't cause a crash. Maybe an integer overflow? Also there seems to be a massive slowdown at higher depths. Any idea why? I was told that the TensorFlow version has this problem as well too.

cn4750 commented 6 years ago

I've done some more testing. I've disabled NNCache to rule that out and get more stable nps for comparisons. I no longer think that it is an integer overflow since I have now gotten a completed run to 500 million total nodes using one thread by setting the MAX_TREE_SIZE to that. Using more than one thread seems to cause a massive performance drop somewhere north of depth 32. My GPU utilization goes from a steady 99% and starts dropping quickly down towards zero.

Here's a graph of the nps on my GTX 960 of one thread and six. The six thread run isn't complete as it is too slow to complete in a timely manner, but the trend is obvious:

haleysa commented 6 years ago

Seems like this could be an overflow of m_playouts? It's shared between threads, it's used to report the nps, and there's some ungainly hack of setting a max to MAXINT/2 to "avoid overflow when multithreading", which doesn't help when using 6 threads and breaking the MAX_TREE_SIZE.

cn4750 commented 6 years ago

I doubt it is an overflow since nodes are only 500 million and m_playouts are only 14,468,324 in the end which is well within any limits of an integer. The same effect is seen with 2 threads as well.

Tilps commented 6 years ago

A graph like that would suggest that m_playouts has stopped incrementing, especially to drop so fast, and to drop below the one thread number. Overflow would show negative nps (and require a somehow going past the max playouts which with MAXINT/2 would have to have all the threads except the main thread run 1 billion playouts before the main thread aborts checks the threshold and aborts).

In order for the nps to continue to be reported, the search loop has to still be running, so play_simulation must be returning an invalid score. The scores are not shared between threads, so the scores themselves being valid or invalid isn't a race condition. So the only way to get invalid scores is if the leaves being visited by the threads must call create_children and have it return false, and then check has_children and it be false.

I think I've ruled out ever other option, which leaves Network::get_scored_moves returning an empty netlist - which has a nice simple return false associated with it in create_children which is actually a terminal condition - if that ever happens the search will break, because any other thread which ends up at this node will also return false, without attempting anything. Once that fails once, if it happens to be on a PV with a large enough margin to overcome virtual loss, every thread will just spin waiting for that node to be populated, which will never happen.

With NN cache disabled, it would appear the only way this could happen is if the current position in the board history was corrupted between checking for whether there are legal moves and asking for the net eval.

I'm having a lot of difficulty working out how that could have happened though. My best guess is UCTSearch::get_pv - which calls bh.cur(),do_move, where other paths which are doing tree walking to bh.do_move - which creates a new position rather than mutating the current one. But the bh is shallow cloned before being passed to get_pv, so the position being mutated, should already be a copy.

In any case, I suggest adding a log statement to the return false in create_children after the call to Network::get_scored_moves - see if my theory is right.

cn4750 commented 6 years ago

m_playouts does continue to increment, just slowly with the node rate. I'll try adding the log statement where you suggest when I get a moment. Thanks for your assistance.

Tilps commented 6 years ago

Slow increment could be that the PV isn't quite strong enough to overcome every possible race condition value of virtual loss, so sometimes one of the threads gets to escape to another path for a try or two.

Tilps commented 6 years ago

I guess I should put one other option out there - a neural net eval deadlocking in OpenCL somehow - that would have a similar effect, but seems somewhat out of our control...

cn4750 commented 6 years ago

I've added a debugging log print statement inside the if statement here, but it does not ever print.

I've also tested if the CPU code path is affected by this same issue; it is:

Tilps commented 6 years ago

I guess I'm not going to solve this just by code inspection then :P
I might have time to try myself tonight - but a couple of suggestions in the mean time. I assume you are using latest master or next - there was a bugfix with locks that went in a week ago. (Not that I've managed to work out how that could cause this problem.) Maybe add a monitor thread which checks that all the workers are not deadlocked?

cn4750 commented 6 years ago

I am using the latest master branch.

Tilps commented 6 years ago

Trying to reproduce this issue on my machine, but I suspect I don't have enough ram. With 6 threads and NNCache enabled I got 120M nodes done successfully with only a smallish slow down between peek at depth 26 and depth 31 which is probably due to reduced NNCache hit rates. I tried 200M, but it seemed to start thrashing ram a bit after 140M or so - so I gave up.

jjoshua2 commented 6 years ago

has intel_mkl openblas and gpu openblas all been tried? I wonder if it is openblas fault... Maybe we can try TF as well for like 30 minutes. I only tried it for 3 and it was depth 32, but still 7k nps.

cn4750 commented 6 years ago

I don't think the BLAS implementation matters since I've shown the issue exists on both GPU and CPU code paths, but if it does for some reason all of my testing has been on Intel MKL.