Open marcinbogdanski opened 4 years ago
re: the second part, happy to have these documented on github and available for posterity :) Feel free to keep opening issues!
Your understanding is spot on.
ModelBatcher
and BufferedModel
are deprecated and will be deleted once I find the time to rewrite cc/eval.cc
.
With regards to threading, inference is always performed synchronously from the point of view of the selfplay thread. Because we run Minigo selfplay at scale by playing multiple games concurrently, we don't need to asynchronous inference to achieve good GPU utilization. Even on very small models, the engine runs at >95% utilization on a v100, and close to 100% for a full sized model.
It has never been the goal of the project to write the fastest tournament engine (I doubt we'll ever enter one), so as you guessed this means the engine is somewhat slower when playing a single game than something like Leela.
What is the ratio of number of concurrent_selfplay
executables to number of GPUs in the system? Is it 1:1 during normal execution?
Just to confirm my understanding (sorry for silly questions): assuming ratio is 1:1, and considering synchronous CPU/GPU execution, after GPU compute finishes there is a small gap in GPU utilisation while CPU does it thing to advance games and prepare next batch. But because in Go neural network is fairly large (GPU compute takes long) and game logic is comparatively quick, the GPU utilisation gap is small and thus non issue?
Thanks!
Gentlemen, could you confirm my further analysis of concurrent_selfplay
are correct?
Selfplayer
: always single instance, which:
ShardedExecutor
with a thread pool, which will be used laterSelfplayer::Run()
(run from thread of the executable):Model
s in InitializeModels
, but does not spin new threads (even for BatchingModel
/ModelBatcher
/BufferdModel
)SelfplayThread
OutputThread
Then, in each SelfplayThread::Run
:
ShardedExecutor
created earlierAs for parameters in ml_perf/flags/19/selfplay.flags
:
selfplay_threads=3
- number of game threads, a thread is blocked on GPU inferenceparallel_search=4
- number of threads in ShardedExecutor
to execute tree searches in parallelparallel_inference=2
- number of models, maximum number of GPU inferences to run simultaneously, too high would cause OOM on GPU?concurrent_games_per_thread=32
- number of separate games per threadvirtual_losses=4
- number of tree leafs to evaluate per iterationAhh, I think I see it now. As long as selfplay_threads
> parallel_inference
(and there is beefy enough CPU) there should be no "GPU gaps", because as soon as one inference finishes, another selfplay thread can "hop in" immediately as long as it has a batch ready. In the case above there is one extra selfplay_thread
(3>2) which will run on CPU even if both GPU models are occupied.
Presumably parallel_search=4
is driven by number of CPU cores on the system, where 4-per-gpu seems about usual.
Does above seem right?
Your analysis of the threading parameters is correct.
Their values were all chosen to get >95% utilization on the VM that I'm using to test the MLperf benchmark: it has 48 physical cores (96 hyperthreads) running at 2GHz and 8 v100 GPUs. Optimal values will be different depending on the relative performance of your CPUs and GPUs.
parallel_inference=2
is used to double-buffer the inference requests so that one thread can prepare the GPU commands while the other is actually executing them. On some setups, parallel_inference=3
is also a viable choice because inference happens in three stages: transfer the feature tensor to GPU, evaluate the model, transfer the output tensor back to CPU.
parallel_search=4
is set so that the tree search doesn't take too long relative to inference: the MLperf model is about 50x smaller than the full size Minigo model. In a full run, we don't need to be so careful tuning these parameters.
If you're interested in how these parameters affect performance, I recommend reading the profiling section of the MLperf docs, which describe how to get CPU traces of the selfplay code. Enable tracing as described in the doc, run selfplay for about 30 seconds, kill the process, and you should have a WTF trace that you can view.
Hi
By "double-buffer the inference", do you mean simply running multiple independent models on same GPU (with obvious memory penalty)? Or is there something more going on? Assuming it's just independent models, is there any explicit mechanism to make sure models execute non-overlapping stages of GPU pipeline (transfer, evaluate, transfer back)? I kind of expect simply running 2-3 models per GPU would sort itself out on it's own in this use case, but just want to confirm.
Also, sorry for repeating, but could you explicitly confirm it's 1x concurrent_selfplay
per GPU? While it seems pretty obvious by now (especially that I found hard coded gpu:0
somewhere), I'm still new to the code base and I don't want this important detail to be lost in translation.
I will definitely look into profiling after I manage to setup sacrificial CUDA dev. box for compilation purposes.
Thanks again for all the help, this was super useful! I think this wraps up my questions for now!
Yes, it's 1x concurrent_selfplay
per GPU.
As for the double-buffering, well now we're getting to the interesting part. We currently create a new tensorflow::Session
for every instance of the model. It's not sufficient to simply run multiple threads each performing instance: what tends to happen is that the TensorFlow framework ends up executing the Session::Run
calls in lock-step, so if you're running N threads, every Session::Run
call starts at the same time and they all take Nx longer to complete than a single call.
This is where the parallel_search
flag comes into play. The fact that all threads share a global thread pool forces their calls to Session::Run
to be staggered, which causes the TensorFlow framework to pipeline their execution correctly.
Here's old trace I found that illustrates this (it's from an experiment running on TPU with slightly different flags --selfplay_threads=4 --parallel_search=4 --parallel_inference=3
but you should get the idea):
Note that the SelectLeaf
calls are scheduled more efficiently in the current master
branch so if you generate a trace yourself it will look a bit different.
Now, it's possible that it would be more efficient to have all model instances loaded from the same file share the same TensorFlow session but that's just one more entry on my list of things I haven't had time to try out :)
Aha!
I had a sneaky suspicion that having ShardedExecutor
shared between selfplay threads is not by accident, now we know why!
Off topic: what do you think about slightly alternative approach, having multiple game threads push eval requests to an async queue and then having neural-net thread(s) picking them up to form batches and execute on GPU? Basically implementing multiple-producer-multiple-consumer pattern. It seems to me MiniGo approach of having multiple concurrent games per thread is a strong benefit to keep total number of threads low. What's your opinion on other possible pros/cons of both approaches?
The threading model you describe is actually how Minigo selfplay used to be set up: we ran one selfplay thread for each game, and their inference requests were batched up and executed on separate inference threads. This was absolutely fine for the full sized Minigo run, we'd have maybe 8 games playing in parallel on a VM with 48 physical cores.
However, the model used for MLPerf is much smaller and we had to run significantly more selfplay threads than there were CPU cores to generate enough work for the GPU. This resulted in a large context switching overhead and reduced maximum GPU utilization.
The large number of threads and context switching overhead also made profiling the CPU code difficult. Once we switched to the current threading model, the simpler CPU traces showed there were some surprising hotspots in the code (e.g. calling argmax
to select which node to visit during search). Here are some functions we found directly as a result of the simpler architecture that were optimized for up to 5x performance improvements:
https://github.com/tensorflow/minigo/blob/e44f4123f6611b4e6c6f01e66252625072a68198/cc/algorithm.cc#L28
https://github.com/tensorflow/minigo/blob/df51963aab4c54e73896045480aff79b8f9ded11/cc/mcts_tree.cc#L258
https://github.com/tensorflow/minigo/blob/64d5410724323dd371c46b0305a3aea7ef61fea6/cc/model/features.h#L114
We also found that it was measurably faster to have the tree search thread call Session::Run
directly, rather than have inference and tree search run on separate threads. This was most likely because the cache of the CPU core running the inference thread wouldn't have the tree search data in it. I only ever profiled this on TPU so I don't know if this finding also applies to GPU TensorFlow.
We also found that it was measurably faster to have the tree search thread call
Session::Run
directly, rather than have inference and tree search run on separate threads.
But if I understand correctly, in your current model tree search thread offloads leaf selection to thread pool in ShardedExecutor
anyway, so in a sense they do run on separate threads. Yeah, it's interesting what actually is going on and how it would work on GPUs.
Yep, optimizing a multithreaded system is hard :)
The original implementation of the SharedExecutor
was careful to always schedule the same games to the same thread for exactly this reason. However that lead to an imbalance of work across the threads because there's a large variation in the number of nodes tree search visits during a game. You can see this in the trace above where different SelectLeaf
blocks take different amounts of time. Also note that the SelectLeaf
block that runs on the selfplay thread is normally the fastest because it gets better cache utilization.
It turned out to be a net win for the SelectLeaf
threads to share an atomic counter into the game array and pop the next available game to run tree search on. Since tree search for a game runs on an arbitrary thread each time, the individual SelectLeaf
calls are slower, but the work is better distributed across the threads and so ends up taking less time.
Quick note: if you found the hard-coded gpu:0
where i think you did, it's because we isolate the selfplay jobs on GPUs using the CUDA_VISIBLE_DEVICES environment variable.
@tommadams I think what you say makes sense, but I need to think more about the implications.
@amj Yeah, that's exactly what I thought. The gpu:0
was a major clue :)
I'd caution against reading too much into what I wrote, these are specific optimizations I made for our architecture and hardware setup. Bottlenecks will vary based on model size, CPU & GPU compute speed, board size, code architecture, etc.
The most important take away should be: make sure it's easy to profile your code :)
Hi
I'm further reading the source code, wanted to clarify my understanding of concurrency/parallelism implementation is correct.
Inference options:
FakeDualNet
- returns fixed policy/valueLiteDualNet
- TFLite integer inference, why? faster CPU inference? play against Minigo on mobile?RandomDualNet
- returns random policy/valueTFDualNet
- standard TensorFlow CPU/GPU inferenceTPUDualNet
- TPU inferenceWaitingModel
- for testingModelBatcher
- request asynchronous mini-batch inference from BufferedModelBufferedModel
- runs in it's own thread, does inference possibly combining mini-batches from multipleModelBatcher
Concurrency/parallelism sources:
Selfplayer
manages multiple threadsSelfplayGame
virtual_losses
*concurrent_games_per_thread
ModelBatcher
/BufferedModel
)Reasons for above architecture is as follows:
Is above correct? Especially are there any other reasons for this setup, that I'm missing?
Also, I don't seem to see any way to execute single tree in parallel for tournament game (e.g. against human champion)?
Also 2, I will have possibly more questions, what would be a preferred communication channel for not-an-issues? Should I keep creating GitHub issues?
Thanks again for your time