Closed gooooloo closed 6 years ago
For MCTS, it doesn't use coroutines or threading. It just repeats searching and pushes unexpanded nodes into a list, until the list reaches some length. What do you think? @mokemokechicken
IMHO, due to GIL, the minigo's performance seems as fast as your coroutine implementation? However, a node could be expanded twice or more, then it will result in less search depth given same simulations_per_move?
@gooooloo Thanks for letting us know the implementation. A quick google search told me that multi-processing, unlike multi-threading, can evade the curse of GIL. Is mokemokechicken's multiprocessing not the same multiprocessing as implemented by Akababa? Have you found any data regarding their speed? In rl_loop.py of minigo, they imported multiprocessing.
@AranKomat
You are talking different things.
The "multiple process" abakaka, mokemokechicken and I talked about, was the architecture that 1 process serves as model predicting server, + multiple selfplay processes without model prediction. Why? In old implementation, every self process uses a little bit of GPU, then the number of sefplay processes are thus limited( e.g. 4?). By the new architecture, we are able to run 10+ self play processes at the same time.
The "multiple process" you mentioned, is another general trick that when you need to do multiple things parallel, do them in different processes rather then in different threads inside one process. Why? Because in Python, there is no real parallel executing threads, due to GIL. But multiple processes avoid this GIL issue. But this trick has also limiation: if you need to communicate data between different processes, it would be tricky, and effieciency also needs to be considered.
Now we have the new architecture with our "multiple process", there is still an issue: inside every self play process, when MCTS, we need to do tree search with 8 threads. GIL limits this. But we can't easily adopt this issue to "multi processes" trick you mentioned, because these 8 threads of searching are updating the same GameTree. This hit the limitation I mentioned in last paragraph.
@gooooloo
Thank you for your information!
For MCTS, it doesn't use coroutines or threading. It just repeats searching and pushes unexpanded nodes into a list, until the list reaches some length. What do you think? @mokemokechicken
Execution flow is almost same as my coroutine implementation. However in my implementation, the prediction worker doesn't wait for queue filled, because there is possibility that search threads(coroutines) are waiting for expanding nodes.
My coroutine implementation can be rewritten like their simple loop implementation. But, I could not do that in the first implementation because I could not image whole implementation...
IMHO, due to GIL, the minigo's performance seems as fast as your coroutine implementation? However, a node could be expanded twice or more,
I think so too. It will result in waste of GPU resource. Maybe they will make performance tuning after testing the basic implementation.
then it will result in less search depth given same simulations_per_move?
It seems that MCTSNode#N
is reverted( -= 1) when already expanded (in MCTSNode#incorporate_results()
),
so search depth is same as simulations_per_move(readouts
).
@mokemokechicken
the prediction worker doesn't wait for queue filled,
Oh yes, I forgot this point.
I could not do that in the first implementation because I could not image whole implementation...
Me too. I actually pretty like their simple but good idea when I first see it. But I think your implementation handles more cases. So I prefer yours ^^.
It seems that MCTSNode#N is reverted( -= 1) when already expanded (in MCTSNode#incorporate_results()), so search depth is same as simulations_per_move(readouts).
Yes it is true. I didn't read it carefully. Thanks. Yet another issue is then that search sim is wasted, result in a little bit low efficiency.
@gooooloo Thank you so much for your explanation. I now understand the issue thoroughly.
@gooooloo @mokemokechicken By the way, you clear the visit stats each time you make a move, but is it a bad idea, upon a move, to retain the visit stats of the subtree whose root is your next state? There are only 64-n possible actions, where n is the halfmove performed so far, and the actually meaningful actions learned by NN are much smaller than that. So, the explorations must pass through very small number of the immediate child node to the root, which makes retaining the visit stats of the relevant subtree save some time. Thought?
@AranKomat No I don't clear all visit state when making a move. I keep the whole visit information of selected child and all its children. I clear other children (and their children) because they won't be used in the game anymore. My codes is here
I think I have figured out Single Thread implementation as MiniGO does, but avoiding the "wasting simulation" issue as mentioned above. Codes are Here. Another benefit is I can easily do thinking time control now, which would be useful when playing with NTest.
@gooooloo That's good!
Like time control, I implemented the stop searching(and callback to report current stats) function around here.
It is useful when running as NBoard engine for good interaction(especially in hint
message).
@mokemokechicken @gooooloo I'm sorry for asking this question in this thread as my question has not much to do with the subject of the thread, but I'd like to clarify the points @gooooloo made. In Zeta36/Akababa's implementation, there are two hyperparameters: max_processes (=3) and search_threads (=16), whereas you guys have two or three with different value: prediction_queue_size (8 or 16), parallel_search_num (8) and multi_process_num (16 moke). I assume multi_process_num and max_processes refer to the same thing, but they are vastly different in quantity (3 vs 16). Is this probably because Akababa/Zeta36 has a GPU much weaker than you guys have (e.g. 1080ti of mokemokechiken)? Or did you guys make some improvement that allowed this enhancement? My code is based on Akababa's, so if the latter is the case, then I have to rewrite my self-play.py and play.py to incorporate your counterparts, which is a rather laborious task. So, this is an important thing for me to make sure. I appreciate your feedback. It is a pity that I cannot contribute to your project, as I'm trying to apply AZ to entirely different things like NLP.
@AranKomat
In @Akababa's repo:
Here is the code to trigger multiple threads for searching (GIL has limitation here) inside one game( of course one process then). It uses the search_threads
hyperparameter.
Here is the code to trigger multi processes for parallel self play executing. It use the max_processes
hyperparameter.
I assume multi_process_num and max_processes refer to the same thing, but they are vastly different in quantity (3 vs 16). Is this probably because Akababa/Zeta36 has a GPU much weaker than you guys have (e.g. 1080ti of mokemokechiken)?
Sorry I have no idea. Akababa's way uses ThreadPoolExecutor, mokemokechicken's way uses coroutine, but I don't think that will be the reason. The only possible issue I would figure out is maybe they have different CPU(not GPU) resources.
@AranKomat
In Zeta36/Akababa's implementation, there are two hyperparameters: max_processes (=3) and search_threads (=16), whereas you guys have two or three with different value: prediction_queue_size (8 or 16), parallel_search_num (8) and multi_process_num (16 moke).
My prediction_queue_size and parallel_search_num have almost same meaning (maybe you know).
I assume multi_process_num and max_processes refer to the same thing, but they are vastly different in quantity (3 vs 16). Is this probably because Akababa/Zeta36 has a GPU much weaker than you guys have (e.g. 1080ti of mokemokechiken)? Or did you guys make some improvement that allowed this enhancement? My code is based on Akababa's, so if the latter is the case, then I have to rewrite my self-play.py and play.py to incorporate your counterparts, which is a rather laborious task.
I think Akababa's implementation is efficient enough. If you use almost 100% of GPU or CPU resources, you don't have to rewrite your codes.
I think the main reason of difference between 3 and 16 is difference between thread and coroutine. ※ Of course there is possibility of difference of CPU resources, as @gooooloo pointed out.
Coroutine has one serial execution flow, the context switch is performed explicitly(by await
), it can be rewritten using loop (in this case).
My prediction worker is not await
ing when calling api#predict()
(here).
It means that whole process is blocked there.
It has to wait for finish prediction even if the prediction is executed on another process (imagine waiting HTTP response in batch programs), the CPU resource is sleeping when waiting.
Thread has multi execution flows and the context switch is performed automatically and unpredictably. Like my implementation, Akababa's prediction blocks the thread (here), but only one thread is blocked. Other threads don't have to wait for the other thread prediction result. So CPU resource is used efficiently in less processes.
Thread's context switch is a little cost, so many(ex. 100~) threads bring waste of CPU resource. But 16 threads per process are no problem, I think. Coroutine context switch cost is less than threads.
Why I like and use coroutine is that it is easy to program and debug, and it doesn't make race condition problems (usually).
@mokemokechicken Nice! Makes sense.
My prediction worker is not awaiting when calling api#predict()
This is the key point I missed. Thanks for pointing it out.
@gooooloo @mokemokechicken Thank you so much for your analysis and helpful feedback! It's great to know that the key is the CPU. Now I don't have to do benchmarking to compare the performance of his and your algorithm. Since I will use an ec2 instance, I don't think I will have a CPU bottleneck, so probably there's nothing to worry about.
I am closing the issue because it is open by me, and no actionable item here.
https://github.com/tensorflow/minigo
Maybe help here.