Model split and GPU memory

tschaffter commented 7 years ago

The mpi-caffe CIFAR10 example doesn't seem to split the AlexNet model between multiple GPUs (I didn't looked in details at examples/cifar10-mpi/cifar10_mpi_train_test.prototxt). Below are the output of Caffe's training on the CIFAR10 example followed by the same training but used by mpi-caffe. When looking at the memory used by GPU 0, it seems that the entire model (~220 MB) is hosted on GPU 0 when using mpi-caffe. Can you provide a modified version of examples/cifar10-mpi/cifar10_mpi_train_test.prototxt where the model is effectively split between three GPUs?

+------------------------------------------------------+                       
| NVIDIA-SMI 352.99     Driver Version: 352.99         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:83:00.0     Off |                    0 |
| N/A   52C    P0   146W / 149W |    269MiB / 11519MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   35C    P8    32W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:87:00.0     Off |                    0 |
| N/A   38C    P8    26W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   35C    P8    29W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     16473    C   ./build/tools/caffe                            212MiB |
+-----------------------------------------------------------------------------+

and here is the output for mpi-caffe CIFAR10 example:

+------------------------------------------------------+                       
| NVIDIA-SMI 352.99     Driver Version: 352.99         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:83:00.0     Off |                    0 |
| N/A   49C    P0   128W / 149W |    258MiB / 11519MiB |     78%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   40C    P0    85W / 149W |    175MiB / 11519MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:87:00.0     Off |                  Off |
| N/A   45C    P0    71W / 149W |    176MiB / 12287MiB |     34%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:88:00.0     Off |                  Off |
| N/A   38C    P0    73W / 149W |     56MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

steflee commented 7 years ago

Please see http://homes.soic.indiana.edu/steflee/mpi-caffe.html for a full description of the cifar10-mpi example. The overview is this example replicates a model across each GPU and combines the output.

To split a single path model across multiple GPUs, you would use MPIBroadcast layers with communication groups containing only the source GPU (i.e. the one with the preceding layers assigned to it) and the next GPU (i.e. the one to receive the output). The MPIBroadcast output on the source GPU will need to be fed into a silence layer.

liuyuyuil commented 7 years ago

Hi, @steflee, after MPIBroadcast, the different processes are computing parallel ? Thanks.

steflee / mpi-caffe

Model split and GPU memory #2