[Multi GPU] Is there any performance effect of using multi_gpu_model for inference mode?

According to the common wisdom recurrent models can benefit from model-parallelism, however multi_gpu_model() offers data-parallelism. For training significant speedup can be obtained when computation takes much longer than data transfer. In inference your computation graph doesn't run in a loop (I mean gradient descent loop), so you can as well just make N replicas of your model and run prediction via multiprocessing. multi_gpu_model() would run synchronously, so it might be less efficient for heterogeneous hardware (running with different speeds). Multiprocessing would be asynchronous and could possibly be more efficient. Just my guess. You can make and experiment and measure both variants.

Anyway, a great way to speedup both training and inference with RNNs in CuDNN. I have observed speedups around 5-10x. Basic RNN implementations in Keras launch so many small kernels, so that the GPU is significantly underutilized. CuDNN performs different optimizations to run fewer big matrix multiplications which are more efficient.

If performance is important you can analyze what's happening on the CUDA level using nvidia profiler. Besides CuDNN another opportunity for speedup is efficient transfer of data to GPU (it should be asynchronous to the computation).

For RNNs you could possibly use run each layer on a different GPU.

In conclusion first make sure you're utilizing a single GPU well, then try to distribute the load to multiple GPUs.

keras-team / keras

[Multi GPU] Is there any performance effect of using multi_gpu_model for inference mode? #9088