tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.47k stars 74.17k forks source link

Improve CUDA peer to peer access to support Amazon P2 instances #5789

Closed Mistobaan closed 7 years ago

Mistobaan commented 7 years ago

If you try to run Tensorflow on a machine that has more than 8 GPU you will receive an error or Warning saying: CUDA_ERROR_TOO_MANY_PEERS.

From the Nvidia forums seems that this is documented behavior:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#peer-to-peer-memory-access

Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. Each device can support a system-wide maximum of eight peer connections.

TF is doing a full NxN peer access map, each 16x16 and that would explain the error on 16 gpus machines.

The challenge is figuring out which GPUs should peer with each other. We do the full NxN right now since we don't yet have a better answer about the physical topology of the devices. (E.g., how do you know that the first 8 are all physically the first die, and the second 8 are all physically the second die?) If such an API exists and we can query it reliably, that might be a better solution.

All the code is in one file: gpu_device.cc

related Issues:

prb12 commented 7 years ago

@Mistobaan

first 8 are all physically the first die, and the second 8 are all physically the second die

I'm guessing you meant bus here, not die?

@poxvoculi I'm pretty sure that our code handles 16 GPUs? (8 x K80 on two PCI buses). Could you take a look at this please?

poxvoculi commented 7 years ago

We are able to run TF on machines that have 8 k80 cards, which appear as 16 GPUs, 8 GPUs (4 cards) on each of two separate PCIe buses (each bus connected to one CPU socket). In this configuration cudaDeviceCanAccessPeer returns false for GPUs on different buses, so cudaDeviceEnablePeerAccess gets called only for GPUs on the same bus. I'm guessing that your system architecture is different, so that cudaDeviceCanAccessPeer returns true for all pairs among 16 GPUs. In that case you're going to need to restrict visibility of the GPUs within a process to 8, as discussed in #5362.

Mistobaan commented 7 years ago

@poxvoculi

The error was discussed on the mailing list and it occurs on the new AWS p2.16xlarge instances.

Looking at the code then there is something strange because I suspect that in gpu_device.cc

if (from->CanEnablePeerAccessTo(to)) { returns true and then auto status = from->EnablePeerAccessTo(to); returns false.

and that was throwing errors before this change

vrv commented 7 years ago

@poxvoculi Yup, we know the driver is only returning the potential peerings, not the optimal peering.

The point of opening up this bug is to eventually figure out a better algorithm for enabling peer access. I'll mark this as contributions welcome for some enterprising developer to figure out a nice, general solution to this.

poxvoculi commented 7 years ago

It seems difficult to have a nice, fully-automatic solution. The consequence of enabling peer access is slightly faster inter-GPU memory copies, with less CPU memory interface contention. If you're using 16 GPUs that can all feasibly be peered to each other, but only 8 can be with respect to any one, the choice of which ones are most useful to peer depends very much on how your model is structured, and maybe secondarily on the underlying system topology. So, a useful contribution might be some kind of startup option that allows explicitly specifying which GPUs to peer, overriding the default behavior of trying to make every feasible peering.

vrv commented 7 years ago

Renamed title to reflect that this affects Amazon P2 instances specifically.

tfboyd commented 7 years ago

Closing. NCCL is the solution NVIDIA created to help navigate this issue but it is not always the right choice. I did not do a lot of testing on the p2.16x large but in testing on a variety of sytems 4x Titan X, 8x k80 on GCE, 8x K80 on AWS and 8x P100 on DGX-1 I found the approach for how to deal with variable update ranges and is also different based on the model. The results are here, which also contains links to details about different approaches of variable management. At this point the p2.16xlarge is not as desirable as 8x P100s, which work well with NCCL or just CPU as the variable CPU.

byronyi commented 6 years ago

I found successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero in the log posted https://github.com/tensorflow/tensorflow/issues/5362#issuecomment-258298504 by @alexatknit, which is opposed to what AWS has been said here:

P2 instances also offer GPUDirect™ (peer-to-peer GPU communication) capabilities for up to 16 GPUs, so that multiple GPUs can work together within a single host.

I do not have access to P2 instances at hand; do we still have this problem with the latest TF?

byronyi commented 6 years ago

According to the QEMU-devel mailing list archive here, the support of GPUDirect P2P in VM requires coordinated support of PEER_CLIQUE_ID from both hypervisor and Nvidia driver, and based on the date of the patch it seems that it is yet to be merged in QEMU.

@benbarsdell may I know if this feature is supported in current CUDA driver release?

tfboyd commented 6 years ago

I test on these systems once a month and the peering works fine. The messages is unfortunately misleading. I have a test that runs VGG16 with NCCL and it would not perform well if the sync was not GPU to GPU.

byronyi commented 6 years ago

@tfboyd Thanks for the heads up!

I'm actually quite interested in the 16 GPUs setup on AWS. All to all peering seems to be unlikely as those 16 GPUs are not connected directly through a single PCIe switch (according to Mu Li's PhD thesis).

Any reason why we don't have single node 16 GPUs benchmark result included in the TF CNN benchmarks page?

tfboyd commented 6 years ago

Yeah, I realized I missed the mark with my comment. 16 gpus in one server is rare and at this point newer hardware makes the setup less interesting. I thought about it when I tested but I just did not find it useful as it is more of an oddity. P100s and V100s are very unlikely to ever be configured like that and I do not think nvlink supports more than 8. I end up wrong alot but when I am wrong I will benchmark it. :-).

On Sep 21, 2017 5:17 PM, "Bairen Yi" notifications@github.com wrote:

@tfboyd https://github.com/tfboyd Thanks for the heads up!

I'm actually quite interested in the 16 GPUs setup on AWS. All to all peering seems to be unlikely as those 16 GPUs are not connected directly through a single PCIe switch (according to Mu Li's PhD thesis).

Any reason why we don't have single node 16 GPUs benchmark result included in the TF CNN benchmarks page?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/5789#issuecomment-331315735, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZesp6p_wUYrWXWBrorsdu2CK7Ygivcks5skvy1gaJpZM4K54NB .

byronyi commented 6 years ago

I have not aware of any virtualization technique that exposes NVLink to the hypervisor, and if the current KVM/Xen doesn't support NVLink up to the same level of their support to PCIe, I doubt if it's possible for us to use native NVLink interconnect on public cloud.

byronyi commented 6 years ago

@tfboyd Even with the latest P100 and V100, the problem still remains. If you take a look of the DGX-1 system bus topology (hypercube alike), it is unlikely that all-to-all peering could be achieved with uniform latency/bandwidth provision.

NCCL claims to handle this case optimally, but when I tested vgg16 on my local box with 4 K40m GPUs in a PCI-passthrough VM:

Parameters Bare Metal VM
variable_update=independent 150 img/sec 150 img/sec
variable_update=parameter_server
local_parameter_device=cpu
150 img/sec 150 img/sec
variable_update=replicated
local_parameter_device=cpu
use_nccl=False
146 img/sec 146 img/sec
variable_update=parameter_server
local_parameter_device=gpu
130 img/sec (OOM) 130 img/sec (OOM)
variable_update=replicated
local_parameter_device=gpu
use_nccl=False
122 img/sec (OOM) 119 img/sec (OOM)
variable_update=replicated
local_parameter_device=cpu
use_nccl=True
94 img/sec 107 img/sec
variable_update=replicated
local_parameter_device=gpu
use_nccl=True
94 img/sec 98 img/sec
tfboyd commented 6 years ago

@byronyi Here are some numbers from AWS

On K80 and I would assume this is also true for K40 even if peering is turned on nccl is often slower due to the sync calls ending up being other work in the thread. But even on a DGX-1 the best (although this is changing) approach was to put the shared parameters on the CPU for resnet and inception but for VGG16 which has a lot more parameters replicated NCCL was optimal.

You can set an NVIDIA env variable: CUDA_DEVICE_MAX_CONNECTIONS=12. I believe the default for K80 and K40 is 8. For me this improved my VGG16 batch-size:32 per GPU time on p2.8xlarge to 266 images/sec with real data and 277 for synthetic (not that that matters for anything). This also made replicated NCCL a viable option. For resnet and inception it made replicated NCCL better and viable but still not as good as the other options. Previously my best for VGG16 on AWS was ~242 images/sec with real data. So a pretty good gain. Not sure if it will help on K40s.

If I find time I will link all of my results so someone can get value out of them. It is really hard for me to make a simple Google Sheet public from my google employee account.

I am about to test an all reduce solution that should work in distributed mode on AWS that I am told may make VGG16 scale. I think I remember seeing VGG scale on MPI but not on a normal network. Also just because I have not seen it does not mean it has not happened but I am excited to try it myself.

I was aware of the DGX-1 topology and I have seen P100s also setup with a ring topology. I am far from an expert but with a movement toward all reduce, I do not think there is a need for direct 1:1 communication. The aggregation is going to go in a ring.