jcjohnson / neural-style

Torch implementation of neural style algorithm
MIT License
18.31k stars 2.7k forks source link

Parallelize work across GPUs #410

Open bitbot256 opened 7 years ago

bitbot256 commented 7 years ago

How do I parallelize work across GPUs? I am using an 8 GPU p2.8xlarge AWS instance and from the below nvidia-smi output you can see only one or two GPUs are processing at any given point of time, even though all GPUs have loaded individual layers. I have also noticed both the initial luajit GPU memory loading (roughly 183MB per GPU) is done in serial from GPU0 through GPU7, and the initial torch loading (first time torch is started on a freshly rebooted machine) is done in serial as well. For the latter, on an 8 GPU machine this takes about 15 minutes!

I am using the param "-gpu 0,1,2,3,4,5,6,7" and have the multigpu_strategy split across my layers on all 8 GPUs. I am using cudnn but not using autotune. VGG-19 with lbfgs and lbfgs_num_correction=20. I am using Cuda 8 with CuDNN 6 and nvidia driver 375.51 on Ubuntu 16.04.

I also tested the starry_night example multi-res script and have the same parallelization problem.

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.51 Driver Version: 375.51 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 0000:00:17.0 Off | 0 | | N/A 51C P0 60W / 149W | 6587MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 0000:00:18.0 Off | 0 | | N/A 46C P0 80W / 149W | 4372MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K80 Off | 0000:00:19.0 Off | 0 | | N/A 53C P0 61W / 149W | 5483MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K80 Off | 0000:00:1A.0 Off | 0 | | N/A 51C P0 79W / 149W | 5398MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla K80 Off | 0000:00:1B.0 Off | 0 | | N/A 55C P0 66W / 149W | 5131MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla K80 Off | 0000:00:1C.0 Off | 0 | | N/A 47C P0 75W / 149W | 5278MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla K80 Off | 0000:00:1D.0 Off | 0 | | N/A 58C P0 126W / 149W | 5172MiB / 11439MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla K80 Off | 0000:00:1E.0 Off | 0 | | N/A 53C P0 74W / 149W | 4819MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 36081 C /home/XXXXXX/torch/install/bin/luajit 6581MiB | | 1 36081 C /home/XXXXXX/torch/install/bin/luajit 4368MiB | | 2 36081 C /home/XXXXXX/torch/install/bin/luajit 5479MiB | | 3 36081 C /home/XXXXXX/torch/install/bin/luajit 5394MiB | | 4 36081 C /home/XXXXXX/torch/install/bin/luajit 5127MiB | | 5 36081 C /home/XXXXXX/torch/install/bin/luajit 5274MiB | | 6 36081 C /home/XXXXXX/torch/install/bin/luajit 5168MiB | | 7 36081 C /home/XXXXXX/torch/install/bin/luajit 4815MiB | +-----------------------------------------------------------------------------+

htoyryla commented 7 years ago

As far as I understand, neural-style does not really attempt to parallelize between GPUs. The multigpu support is there to overcome the memory limit of a single GPU. The layers of the model are split into chunks which are allocated to different GPUs, but as the model is sequential, the GPUs must operate in series, each processing the output of the previous one further.

See the code https://github.com/jcjohnson/neural-style/blob/master/neural_style.lua#L395-L402

See also the the documentation of nn.GPU https://github.com/torch/nn/blob/master/doc/simple.md#nn.GPU which states that "The intended use-case is not for model-parallelism where the models are executed in parallel on multiple devices, but for sequential models where a single GPU doesn't have enough memory"

So it does no good to spread the work over too many GPUs, that will only create extra overhead.

michaelhuang74 commented 7 years ago

For both model parallelism and data parallelism, the computation of a layer of an given image takes place on a single GPU.

I want to reduce the latency of applying neural style on a single content image. Using a single Titan Xp GPU and when the image_size=800, 100 iterations of neural style take about 20 seconds, i.e., 1 iteration takes 0,2 second.

It is possible to parallelize the computation of one layer of a convolutional neural network on multiple GPUs in torch7? For example, use 2 GPUs to execute neural style for one input image. The input image will be split to two halves, each of which is assigned to one GPU. After one layer of CNN, these two GPUs need to synchronize to exchange the tensor elements along the boundary.

This kind of parallelism definitely will introduce a lot of communication among participating GPUs. I am curious in (1) if torch7 supports this kind of parallelism, (2) how much speedup we can achieve using this kind of parallelism (i.e., how much latency we can reduce when applying neural style on one input content image)?