[QST]: How does cuGraph place the graph data In the graph coarsening phase of multi-GPU Louvain?

darius513 commented 2 weeks ago

What is your question?

In the graph coarsening phase of multi-GPU Louvain in cuGraph, how does cuGraph place the graph data? Is it:

Gathering the whole graph data on each GPU and, after coarsening, partitioning it across the GPUs (In this case, at some point, a single GPU will hold the entire graph data, resulting in very high peak memory usage)?
During the entire coarsening process, each GPU only holds a portion of the graph data (This method has lower peak memory usage)?

Code of Conduct

[x] I agree to follow cuGraph's Code of Conduct
[x] I have searched the open issues and have found no duplicates for this question

ChuckHastings commented 2 weeks ago

The latter.

cugraph does 2D partitioning. There is a hashing function that takes the pair of vertex ids that define the edge and maps them to a particular GPU.

When we coarsen the graph, each GPU iterates over its set of edges and creates an edge for the coarsened graph by replacing the vertex id on either end with the cluster id that it is assigned to. This is actually done in chunks called edge partitions (leaving out some details as to how) to minimize the memory footprint.

For a given edge partition, we then group by new edge and reduce to consolidate edges on that GPU. We then shuffle the new edge from that edge partition onto the GPU where that new edge would be assigned. We then group by edge and reduce to consolidate edges on that GPU. We continue this process for each edge partition on each GPU. Finally, at the end, we renumber the vertices and generate the coarsened graph. At each step along the way, we delete any intermediate data structures that are generated once they are no longer needed (e.g. after we shuffle we don't need the unshuffled copy, after we group by and reduce we no longer need the original structure).

This keeps the memory footprint manageable. Peak intermediate memory usage will be about 2X of the largest edge partition.

darius513 commented 2 weeks ago

Thank you very much for your response. I'd like to confirm: Is the peak intermediate memory usage approximately twice the storage space of the edges and edge weights in each edge partition before coarsening? Does this intermediate memory include the storage space of the edges and edge weights in the edge partition before coarsening?

ChuckHastings commented 5 days ago

I think that's a fair characterization of that individual step.

I will note that there are other impacts on memory utilization.

Louvain is passed a read-only view of the graph, so the original graph is always consuming memory
When we perform the first coarsening we have 2 graphs in memory (the original and the first coarsened graph)
Subsequence coarsenings result in 3 graphs being in memory (temporarily) at the same time. Once the third graph is constructed the previous graph is deleted... but all 3 will be in memory temporarily. Generally, each coarsened graph is smaller and we don't worry too much about this extra memory pressure. But if for some reason there was little reduction in the first several outer loops of Louvain, you could potentially have big memory pressure there.

darius513 commented 4 days ago

Thank you very much for your response. At the end of the multi-GPU version of the coarsen_graph function, the create_graph_from_edgelist function is called to create a new graph using the coarsened graph, where the edgelist_srcs and edgelist_dsts (from the coarsened graph) are partitioned and copied into device vectors on each GPU. Does this mean that the coarsen_graph function will store a complete copy of the coarsened graph on each GPU (rather than a partial coarsened graph)?

// 2. split the input edges to local partitions

  std::vector<rmm::device_uvector<vertex_t>> edge_partition_edgelist_srcs{};
  edge_partition_edgelist_srcs.reserve(minor_comm_size);
  for (int i = 0; i < minor_comm_size; ++i) {
    rmm::device_uvector<vertex_t> tmp_srcs(edgelist_edge_counts[i], handle.get_stream());
    thrust::copy(handle.get_thrust_policy(),
                 edgelist_srcs.begin() + edgelist_displacements[i],
                 edgelist_srcs.begin() + edgelist_displacements[i] + edgelist_edge_counts[i],
                 tmp_srcs.begin());
    edge_partition_edgelist_srcs.push_back(std::move(tmp_srcs));
  }
  edgelist_srcs.resize(0, handle.get_stream());
  edgelist_srcs.shrink_to_fit(handle.get_stream());

  std::vector<rmm::device_uvector<vertex_t>> edge_partition_edgelist_dsts{};
  edge_partition_edgelist_dsts.reserve(minor_comm_size);
  for (int i = 0; i < minor_comm_size; ++i) {
    rmm::device_uvector<vertex_t> tmp_dsts(edgelist_edge_counts[i], handle.get_stream());
    thrust::copy(handle.get_thrust_policy(),
                 edgelist_dsts.begin() + edgelist_displacements[i],
                 edgelist_dsts.begin() + edgelist_displacements[i] + edgelist_edge_counts[i],
                 tmp_dsts.begin());
    edge_partition_edgelist_dsts.push_back(std::move(tmp_dsts));
  }
  edgelist_dsts.resize(0, handle.get_stream());
  edgelist_dsts.shrink_to_fit(handle.get_stream());

ChuckHastings commented 4 days ago

No. Each GPU will contain a subset of the coarsened graph.

One of the details I left out above (this description)

When we coarsen the graph, each GPU iterates over its set of edges and creates an edge for the coarsened graph by replacing the vertex id on either end with the cluster id that it is assigned to. This is actually done in chunks called edge partitions (leaving out some details as to how) to minimize the memory footprint.

is the effect you're seeing here. You can think of the 2D partitioning as breaking the adjacency matrix into a series of tiles. But rather than a simple 1 tile per partition, we create multiple tiles and store multiple tiles on each partition. Without going into all of the details (https://ieeexplore.ieee.org/document/9926341 describes the technique in more detail), storing things with multiple tiles per partition allows us to minimize the computation and communications required. The multiple edge partitions are essentially the different tiles of the adjacency matrix that are assigned to this GPU. The edge partitions are separated so that all of the edges that have a destination on GPU X are together in one partition, and the edges that have a destination on GPU Y are together in a different partition.

ChuckHastings commented 4 days ago

Sorry... my answer got reordered because I referenced something I said earlier. Look above here for my answer to this question: https://github.com/rapidsai/cugraph/issues/4638#issuecomment-2341854231

darius513 commented 3 days ago

Thank you again for your reply, i am closing this issue.

rapidsai / cugraph

[QST]: How does cuGraph place the graph data In the graph coarsening phase of multi-GPU Louvain? #4638

What is your question?

Code of Conduct