Open archenroot opened 5 years ago
We did test Gunrock with graphs in the range of hundred million vertices and billions of edges. The GPU memory size is the limiting factor in terms of how big the graph can be. https://github.com/gunrock/gunrock/issues/322#issuecomment-368286550
There has been work done in our group on distributed (multi-node) graph processing, which may see its way in gunrock eventually.
There's also research done in how to handle streaming graphs, which I assume you mean by graph buffering? Aside from that, if you don't care a whole lot about performance, this big of a graph can be handled by using UVM (managed memory in CUDA), but it is going to be slow.
A lot of different avenues we are exploring! Multi-GPU currently can give is 32x8 or 32x16 GB memory on the DGX1 and DGX2, that is the cap for now. @sgpyc can comment more on this.
@neoblizz - definitelly we can talk on DGX{1,2} about tens of bilions of V/E, that is good to hear. I plan to prepare incremental batches starting with 1-X bilions until it crashes actually.
@archenroot looking forward to your findings as well, let us know how your CPU results for those incremental batches compare against gunrock.
I will come back of course. The focus will be only on shortest path algorithm at moment to start with with incremental size of V/E.
@neoblizz - just to confirm I read somewhere each node is stored as 4 byte integer. So am I or not limited by max value (unsign) of 4,294,967,295 of unique vertices/edges? Just to confirm.
@archenroot I believe with --64bit-SizeT
you can go more than that (int64).
@neoblizz - nice.
64bit-SizeT
I saw on some issue discussion that edge is stored by using 4 bytes...: https://github.com/gunrock/gunrock/issues/322
The maximum graph I tried has about 1 billion vertex and several billion edges, but that's on 6 GPUs if I remember correctly (only the master branch supports multiple GPUs at the moment). On a single GPU, the memory size is the limiting factor of how large the graph can be.
--64bit-SizeT
will enable support for more than 2 billion edges, provided the app you run has that option turned on.
@sgpyc - What GPU types?
6 K40 GPUs
@neoblizz - regarding multi-node cluster. i opened issue regarding RAPIDS integration (which supports multi-node deployments) to support Spark GraphX project via Gunrock (we will see how they think about it). Secondly I discovered some project called Lux which also claim to support multi-node graph processing: https://github.com/LuxGraph/Lux
@archenroot as John pointed out in the RAPIDS issue, we are already one of the contributors to RAPIDS and in conversation with the NVIDIA folks.
I have not looked at LuxGraph yet, if I get a chance, I'll take a look -- but it seems like a project in progress. From first glance, I don't see a lot of algorithms mapped, legion is definitely interesting work!
@neoblizz - I see now, make sense. Thx for link about legion, looks promising.
@neoblizz - I also noticed you integrated with METIS for multiGPU layer management. As this is done. I think similar abstraction by using completely same architecture can be used to support multi-node cluster, right. Any plans on this?
Something like: Metis cluster partitioner instance |---> Metis node partitioner instance |-------> GPU0 |-------> GPU1 |-------> GPU2 |---> Metis node instance |-------> GPU0 |-------> GPU1 |-------> GPU2
Any plans on this?
So it is really desirable from an API/implementation point of view to use CUDA's peer access feature within a node to access data across GPUs. DGX-2 supports this across all GPUs, for instance. However, this feature is not available between nodes.
This is regarding some POC I would like to prepare, at moment especially for Shortest path algo. We have currently in our graph db about 1bil vertices and 1,5bil edges and about almost 1TB db size (we store on vertices also some properties, so not pure ids only. It performs quite good only on CPU, I plan to push it to the limit and find treshold and test against gunrock on multigpu system.
Q1: Have you tested gunrock with something like 5-100 bil vertices and 10-150bil edges? Storage for poc is not an issue, I plan to establish HDFS(raw)/HBase for distributed test or mounted lun with ext4. Q2: Does gunrock somehow handle situation where I cannot load full graph into gpu memory, some kind of graph buffering?
Sorry I am new to gunrock, so discovering all posibilities at moment, thx.