keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.92k stars 19.45k forks source link

[Multi GPU] Is there any performance effect of using multi_gpu_model for inference mode? #9088

Closed jurukode closed 3 years ago

jurukode commented 6 years ago

Dear Keras Fellow,

I have quick question related using multi_gpu_model. Just curious if there is any performance effect of using multi_gpu_model for inference mode? And if it is, which part we can leverage? is data batching parallelism? or more in inference process? Thank you.

NB: My model is RNN based.

bzamecnik commented 6 years ago

According to the common wisdom recurrent models can benefit from model-parallelism, however multi_gpu_model() offers data-parallelism. For training significant speedup can be obtained when computation takes much longer than data transfer. In inference your computation graph doesn't run in a loop (I mean gradient descent loop), so you can as well just make N replicas of your model and run prediction via multiprocessing. multi_gpu_model() would run synchronously, so it might be less efficient for heterogeneous hardware (running with different speeds). Multiprocessing would be asynchronous and could possibly be more efficient. Just my guess. You can make and experiment and measure both variants.

Anyway, a great way to speedup both training and inference with RNNs in CuDNN. I have observed speedups around 5-10x. Basic RNN implementations in Keras launch so many small kernels, so that the GPU is significantly underutilized. CuDNN performs different optimizations to run fewer big matrix multiplications which are more efficient.

If performance is important you can analyze what's happening on the CUDA level using nvidia profiler. Besides CuDNN another opportunity for speedup is efficient transfer of data to GPU (it should be asynchronous to the computation).

For RNNs you could possibly use run each layer on a different GPU.

In conclusion first make sure you're utilizing a single GPU well, then try to distribute the load to multiple GPUs.