[question] Gunrock buffering for large graphs

archenroot commented 5 years ago

This is regarding some POC I would like to prepare, at moment especially for Shortest path algo. We have currently in our graph db about 1bil vertices and 1,5bil edges and about almost 1TB db size (we store on vertices also some properties, so not pure ids only. It performs quite good only on CPU, I plan to push it to the limit and find treshold and test against gunrock on multigpu system.

Q1: Have you tested gunrock with something like 5-100 bil vertices and 10-150bil edges? Storage for poc is not an issue, I plan to establish HDFS(raw)/HBase for distributed test or mounted lun with ext4. Q2: Does gunrock somehow handle situation where I cannot load full graph into gpu memory, some kind of graph buffering?

Sorry I am new to gunrock, so discovering all posibilities at moment, thx.

neoblizz commented 5 years ago

We did test Gunrock with graphs in the range of hundred million vertices and billions of edges. The GPU memory size is the limiting factor in terms of how big the graph can be. https://github.com/gunrock/gunrock/issues/322#issuecomment-368286550

There has been work done in our group on distributed (multi-node) graph processing, which may see its way in gunrock eventually.

There's also research done in how to handle streaming graphs, which I assume you mean by graph buffering? Aside from that, if you don't care a whole lot about performance, this big of a graph can be handled by using UVM (managed memory in CUDA), but it is going to be slow.

A lot of different avenues we are exploring! Multi-GPU currently can give is 32x8 or 32x16 GB memory on the DGX1 and DGX2, that is the cap for now. @sgpyc can comment more on this.

archenroot commented 5 years ago

@neoblizz - definitelly we can talk on DGX{1,2} about tens of bilions of V/E, that is good to hear. I plan to prepare incremental batches starting with 1-X bilions until it crashes actually.

neoblizz commented 5 years ago

@archenroot looking forward to your findings as well, let us know how your CPU results for those incremental batches compare against gunrock.

archenroot commented 5 years ago

I will come back of course. The focus will be only on shortest path algorithm at moment to start with with incremental size of V/E.

archenroot commented 5 years ago

@neoblizz - just to confirm I read somewhere each node is stored as 4 byte integer. So am I or not limited by max value (unsign) of 4,294,967,295 of unique vertices/edges? Just to confirm.

neoblizz commented 5 years ago

@archenroot I believe with --64bit-SizeT you can go more than that (int64).

archenroot commented 5 years ago

@neoblizz - nice.

archenroot commented 5 years ago

64bit-SizeT

I saw on some issue discussion that edge is stored by using 4 bytes...: https://github.com/gunrock/gunrock/issues/322

sgpyc commented 5 years ago

The maximum graph I tried has about 1 billion vertex and several billion edges, but that's on 6 GPUs if I remember correctly (only the master branch supports multiple GPUs at the moment). On a single GPU, the memory size is the limiting factor of how large the graph can be.

--64bit-SizeT will enable support for more than 2 billion edges, provided the app you run has that option turned on.

archenroot commented 5 years ago

@sgpyc - What GPU types?

sgpyc commented 5 years ago

6 K40 GPUs

archenroot commented 5 years ago

@neoblizz - regarding multi-node cluster. i opened issue regarding RAPIDS integration (which supports multi-node deployments) to support Spark GraphX project via Gunrock (we will see how they think about it). Secondly I discovered some project called Lux which also claim to support multi-node graph processing: https://github.com/LuxGraph/Lux

neoblizz commented 5 years ago

@archenroot as John pointed out in the RAPIDS issue, we are already one of the contributors to RAPIDS and in conversation with the NVIDIA folks.

I have not looked at LuxGraph yet, if I get a chance, I'll take a look -- but it seems like a project in progress. From first glance, I don't see a lot of algorithms mapped, legion is definitely interesting work!

archenroot commented 5 years ago

@neoblizz - I see now, make sense. Thx for link about legion, looks promising.

archenroot commented 5 years ago

@neoblizz - I also noticed you integrated with METIS for multiGPU layer management. As this is done. I think similar abstraction by using completely same architecture can be used to support multi-node cluster, right. Any plans on this?

Something like: Metis cluster partitioner instance |---> Metis node partitioner instance |-------> GPU0 |-------> GPU1 |-------> GPU2 |---> Metis node instance |-------> GPU0 |-------> GPU1 |-------> GPU2

Any plans on this?

jowens commented 5 years ago

So it is really desirable from an API/implementation point of view to use CUDA's peer access feature within a node to access data across GPUs. DGX-2 supports this across all GPUs, for instance. However, this feature is not available between nodes.

gunrock / gunrock

[question] Gunrock buffering for large graphs #431