Closed yaroslavvb closed 7 years ago
On our experiments, there are two Telsa K80 GPUs, each of which has two GK210 GPUs, and there is no p2p communication between two K80 cards. So the printed matrix is as followed:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1 2 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y Y N N I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1: Y Y N N I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 2: N N Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 3: N N Y Y
Thanks for quick reply!
It seems your setup places variables on GPU:0, which means that GPU:2 and GPU:3 reads will need two copies -- GPU:0 -> CPU, CPU->GPU:2
I'm assuming other frameworks keep variables on CPU and don't run into this problem. One way to make results more comparable would be to place variables on CPU for TF implementation
IE, the line which does
with tf.device('/gpu:%s'%device_ids[i]):
Could be changed to
with assign_to_device('/gpu:%s'%device_ids[i]):
Where assign_to_device
is implemented as
def assign_to_device(device, ps_device="/cpu:0"):
def _assign(op):
node_def = op if isinstance(op, tf.NodeDef) else op.node_def
if node_def.op == "Variable":
return ps_device
else:
return device
return _assign
Not all frameworks keep variable on CPU, e.g., MXNet supports the configuration to place variables on GPU. For TF, I refer to the official tutorial of multi-GPU setup: https://www.tensorflow.org/tutorials/deep_cnn/, which stores and updates the parameters on CPU side.
This issue was addressed by https://github.com/hclhkbu/dlbench/pull/4
Hi, @yaroslavvb , I put the same code to AlexNet, but it turns out to be slower on both 2 GPUs and 4 GPUs. Could you help to confirm that which mean is putting the parameters on CPU? And why the parallelism of ResNet is much better than AlexNet? Thank you!
AlexNet is a small network, and low performance on small networks is a known issue, discussion is https://github.com/tensorflow/tensorflow/issues/5516
Do your resnet multi-GPU runs utilize p2p communication? IE, on first session.run it should print matrix like this for all p2p routes: