hclhkbu / dlbench

Benchmarking State-of-the-Art Deep Learning Software Tools
http://dlbench.comp.hkbu.edu.hk/
MIT License
170 stars 47 forks source link

Peer-to-peer for TensorFlow resnet setup? #1

Closed yaroslavvb closed 7 years ago

yaroslavvb commented 7 years ago

Do your resnet multi-GPU runs utilize p2p communication? IE, on first session.run it should print matrix like this for all p2p routes:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 5 with propI tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y Y Y Y Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   Y Y Y Y Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   Y Y Y Y Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   Y Y Y Y Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 4:   Y Y Y Y Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 5:   Y Y Y Y Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 6:   Y Y Y Y Y Y Y Y    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 7:   Y Y Y Y Y Y Y Y
shyhuai commented 7 years ago

On our experiments, there are two Telsa K80 GPUs, each of which has two GK210 GPUs, and there is no p2p communication between two K80 cards. So the printed matrix is as followed:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1 2 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y Y N N I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1: Y Y N N I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 2: N N Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 3: N N Y Y

yaroslavvb commented 7 years ago

Thanks for quick reply!

It seems your setup places variables on GPU:0, which means that GPU:2 and GPU:3 reads will need two copies -- GPU:0 -> CPU, CPU->GPU:2

I'm assuming other frameworks keep variables on CPU and don't run into this problem. One way to make results more comparable would be to place variables on CPU for TF implementation

IE, the line which does with tf.device('/gpu:%s'%device_ids[i]):

Could be changed to

with assign_to_device('/gpu:%s'%device_ids[i]):

Where assign_to_device is implemented as

def assign_to_device(device, ps_device="/cpu:0"):
    def _assign(op):
        node_def = op if isinstance(op, tf.NodeDef) else op.node_def
        if node_def.op == "Variable":
            return ps_device
        else:
            return device
    return _assign
shyhuai commented 7 years ago

Not all frameworks keep variable on CPU, e.g., MXNet supports the configuration to place variables on GPU. For TF, I refer to the official tutorial of multi-GPU setup: https://www.tensorflow.org/tutorials/deep_cnn/, which stores and updates the parameters on CPU side.

yaroslavvb commented 7 years ago

This issue was addressed by https://github.com/hclhkbu/dlbench/pull/4

shyhuai commented 7 years ago

Hi, @yaroslavvb , I put the same code to AlexNet, but it turns out to be slower on both 2 GPUs and 4 GPUs. Could you help to confirm that which mean is putting the parameters on CPU? And why the parallelism of ResNet is much better than AlexNet? Thank you!

yaroslavvb commented 7 years ago

AlexNet is a small network, and low performance on small networks is a known issue, discussion is https://github.com/tensorflow/tensorflow/issues/5516