quiver-team / torch-quiver

PyTorch Library for Low-Latency, High-Throughput Graph Learning on GPUs.
https://torch-quiver.readthedocs.io/en/latest/
Apache License 2.0
293 stars 36 forks source link

poor scalability when using multiple gpus #99

Open Joeyzhouqihui opened 2 years ago

Joeyzhouqihui commented 2 years ago

When we use multiple gpus to do sampling with quiver in the mode of gpu sampling(graph stored in gpu memory), we found that the scalability is poor.

To be specific, we run the example code on reddit and the sampling cost is about 1.11s when using 1 gpu. We expect the time cost of sampling using 8 gpus to be about 8x lower since all gpus do sampling independently. However, when 8 gpus are used, the sampling cost is 0.79s, which is much higher than we have expected. In addition, when using 4 gpus, the sampling cost is 0.66s which is lower than the case of 8 gpus.

Could you please give us some insight or explanation about this phenomenon? Thank you so much!

ZenoTan commented 2 years ago

We will test the sampling scalability on our machine. We suspect it could be some contention of shared resource.

eedalong commented 2 years ago

Well, this is strange, it should not happen because they are doing on different devices as you said. But we will look into this problem and give you feedback ASAP.

Joeyzhouqihui commented 2 years ago

Thank you so much and look forward to your findings!

Joeyzhouqihui commented 2 years ago

Hi, sorry for bothering you. I am a little wondering if there is any profile results or findings?