rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.68k stars 300 forks source link

[FEA] Add support to handle multiple small graphs #2762

Open BradReesWork opened 2 years ago

BradReesWork commented 2 years ago

There are numerous cases where the data consists of a collection of small graphs. It would be nice if cuGraph could handle arrays of small graphs.

Data could be up to a million small graphs, each graph having 10k nodes or less

Note: Each graph could be a property graph with node/edge attributes

ChuckHastings commented 1 year ago

We should have a conversation about this, when it becomes a priority.

It seems like this can be done entirely in python. If, for example, you numbered the vertices in each of the small graphs with a 32-bit integer and numbered the graphs with a 32-bit integer, you could create vertex ids for this data that would fit in a 64-bit integer and would essentially give you all of the graphs as disconnected components in a single graph. You could do very fast shift/mask arithmetic on the original vertex ids to get a graph id and vertex id back from any vertex id in the new graph.

Enabling vertex renumbering in calls to the C API (for SG, it's required for MG and not specified) would give us an efficient internal representation and automatically translate back to these 64-bit vertex ids. This would allow all graph algorithms to work on the created graph.

Alternatively, if we have expanded cugraph_etl by then to support multi-column vertex identifiers, we could have a column that identifies the graph id and a column that identifies the vertex id within the graph. This would be a bit more efficient, especially if you have non-integer vertex ids in the original data. If we haven't expanded cugraph_etl by then, perhaps this would be the motivating reason. Perhaps we first implement in python and then work on a cugraph_etl update to move the functionality there.