rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.68k stars 300 forks source link

[ENH] class Graph API (python) design needs improvements #192

Closed seunghwak closed 4 years ago

seunghwak commented 5 years ago

Is your feature request related to a problem? Please describe. Related: https://github.com/rapidsai/cugraph/issues/187

class Graph (defined in c_graph.pyx) stores a graph and provides functions to initialize, delete, convert (between different graph formats), and query.

There are corresponding C++/CUDA level functions.

There exist some inconsistencies in naming, and some functions may surprise users. In general, surprising users is bad in API design. If a user expects a function to do A, but the function silently does B, or (B + A), it may leads to a hard-to-fix bug till the user finds that the function is doing something unexpected.

So inconsistencies. In the python level, add_edge_list and add_adj_list initialize a graph while add_transposed_adj_list performs conversion. In CUDA/C++, gdfadd..._list performs conversion but add_edge_list and add_adjlist perform initialization in python. In CUDA/C++, gdf..._list_view functions perform initialization. Calling both add_edge_list and add_adj_list can result in a single graph instance storing two different graphs (issue #187). And we have num_vertices, why no num_edges?

And the surprise part. view_edge_list, view_adj_list, and num_vertices implicitly perform conversion. This requires more thoughts as there is a design trade-off. Implicit (or automatic) conversion allows users to type less. But this can require a significant amount of computing and also more memory for large graphs. It can be quite surprising for users to see memory allocation error after long wait when they simply query the number of vertices.

Describe the solution you'd like So we may clarify the API functions for initialization, deletion, conversion, and query and also need to fix inconsistencies between the python API and C++/CUDA API.

For initialization:

For deletion: What we have looks OK to me...

For conversion:

For query:

Describe alternatives you've considered Any comments and thoughts (and new options) will be appreciated.

ericmjl commented 5 years ago

I would like to mirror the comments by @seunghwak. In addition, coming in as a NetworkX user, I was also surprised by the API design. Being surprised by the API is a red flag for me as an end-user, and a tutorial leader for NetworkX-based tutorials (I have a number coming up over the next few months).

Having seen a tweet on Twitter, I had quite a number of high hopes regarding the API. In particular, because NetworkX has been so idiomatic for learning and doing graph analytics, I thought it would be super awesome if cuGraph just outright copied the NetworkX API, and the team followed-up with PRs to NetworkX to suggest improvements to the NetworkX API (and sync it up) where the nx API might not be the best. In particular, if the API were switched over to explicitly mirror the NetworkX API, I could very easily recommend using it as the solution for scalability when I do my NetworkX tutorials at conferences. Scaling up is a common question for NetworkX users, and I have traditionally recommended iGraph with some hesitation, because the API is different. As it stands, however, cuGraph is not a solution I would be ready to recommend - the API is very different from NetworkX's API.

Graph creation, in particular, is a great example of where I was surprised when I read the cuGraph API. Reading the docs is when I realized that cuGraph is instead not a drop-in replacement for NetworkX.

In NetworkX, the following API calls are used:

import networkx as nx

G = nx.Graph()
G.add_edges_from(...)  # add a set of edges to the graph
G.add_edge(...)  # add an edge to the graph
G.add_nodes_from(...)   # add a set of nodes to the graph
G.add_node(...)  # add a node
G.nodes  # get view of nodes
G.edges  # get view of edges

In order to make cuGraph a drop-in replacement for NetworkX, at the very minimum, those same class methods should be present.

In addition, in NetworkX, "edge list" is the term used for a list of two- or three-tuples that are passed in to create edges, with nodes automatically created. "adjacency", from my memory, as a term it is reserved the adjacency matrix. This separation of vocabulary makes it much easier for an end-user to reason about what exactly we're requesting for from the graph.

I'm happy to chat further, and even help out with input on where to match the API for end-users. The underlying implementation is one for which I'm less concerned about (though important, for reproducibility/accuracy/"science etc."), but the API really matters.

The RAPIDS team has done a great job already with cuDF copying the pandas API, and CuPy also has been awesome with its explicit "copy NumPy" policy. cuGraph actually copying NetworkX's API would make this a super awesome project too!

seunghwak commented 5 years ago

Thank you very much @ericmjl for the detailed comments. This makes a lot of sense of to me, and we greatly appreciate your feedback and this will be really helpful for us to improve cuGraph!!!

ericmjl commented 5 years ago

My pleasure, @seunghwak! Please do let me know if I can help in some way. I'll be keeping an eye out here! :smile:

seunghwak commented 5 years ago

Thanks a lot, and we'll post the proposed updates in this thread before jumping to implementation, and we'll greatly appreciate any feedbacks!!!

seunghwak commented 5 years ago

We drafted an initial plan to better match the NetworkX API and summarized design issues that need further discussion. Now this discussion is moved to github, and any feedback or suggestions will be greatly appreciated.

cuGRAPH adding an API Layer to match NetworkX

Related github issues: https://github.com/rapidsai/cugraph/issues/140https://github.com/rapidsai/cugraph/issues/187 https://github.com/rapidsai/cugraph/issues/192

Table of Contents

  1. Design Objectives
  2. Major Challenges and Design Decisions
  3. Class Graph
  4. Graph Generation 4.1. R-mat graph generation
  5. Algorithms 5.1. Breadth-first-search 5.2. Jaccard 5.3. Louvain 5.4. Overlap Coefficient 5.5. Pagerank 5.6. Spectral Clustering 5.7. Single-source-shortest-path
  6. References

1. Design Objectives

2. Major Challenges and Design Decisions

cuGraph is the compute engine for graph analytics for the bigger RAPIDS project. We do not store all the auxiliary data that may not be used by cuGraph algorithms inside cuGraph objects. NetworkX stores graph, node, and edge attributes inside NetworkX objects whether they are used by NetworkX algorithms or not. We attempt to follow the NetworkX API wherever possible but we need to deviate from the NetworkX API in several places due to this difference; cuGraph is more restrictive in supporting graph, node, and edge attributes.

3. Class Graph

NetworkX & cuGraph comparison

Design Issues

API Proposal

4. Graph Generation

4.1. R-mat graph generation

NetworkX & cuGraph comparison

API Proposal

5. Algorithms

5.1. Breadth-first-search

NetworkX & cuGraph comparison

API Proposal

5.2. Jaccard

NetworkX & cuGraph comparison

Design Issue

API Proposal

5.3. Louvain

NetworkX & cuGraph comparison

Design Issues

API Proposal

5.4. Overlap Coefficient

NetworkX & cuGraph comparison

Design Issue

API Proposal

5.5. Pagerank

NetworkX & cuGraph comparison

API Proposal

5.6. Spectral Clustering

NetworkX & cuGraph comparison

Design Issues

API Proposal

5.7. Single-source-shortest-path

NetworkX & cuGraph comparison

API Proposal

6. References

[NetworkX] Aric Hagberg, Dan Schult, Pieter Swart, NetworkX Reference Release 2.3rc2.dev20190331021853, March 31st, 2019.

ericmjl commented 5 years ago

@seunghwak amazing work here! Thanks for posting back.

Looking through the design document, I have the following comments:

Multigraph support.

I think it is reasonable to keep multigraph support on a "later in the roadmap" case. The most common graph use cases are single edges between nodes, as well as distinguishing between directed and undirected.

We store only the attributes directly used by cuGraph algorithms inside cuGraph objects; cuGraph objects can store the node attribute “bipartite” and the edge attribute “weight” but other attributes are stored outside cuGraph objects.

Perhaps taking advantage of the underlying columnar data structure could be helpful? A common use case for graphs is to store the same kind of data under each key for each node, and do a similar thing for each edge. This is equivalent to keeping a node table (rather than a node list) and an edge table (rather than an edge list). Both of these can be cast as data frames; an example I use is the Divvy bike sharing dataset in this notebook on NAMS.

Graphs are initialized with edge data. How should we support the node attribute “bipartite”?

Also suggest using the aforementioned data structure?

Also thought a bit about how API access can be done later - it's common for NetworkX users to query for nodes/edges with a certain attribute value directly. These are equivalent to dataframe filtering operations.

Should we expose multiple formats (CSR, CSC, COO) to users? For example, cuGraph currently has delete_edge_list(), delete_adj_list(), and delete_transposed_adj_list(). NetworkX has only a single clear() function.

Personal opinion, but I think simplicity would be better in this case. I don't know how to articulate why, it's just a gut feeling.

We may consider supporting a list of tuples and NetworkX graph objects in the future to improve NetworkX compatibility (assuming that cuDF provides an efficient mechanism to convert a list of tuples or python dictionaries to a columnar GPU DataFrame).

That would be amazing! That's the most idiomatic way of interacting with NetworkX graphs at the moment. The next most idiomatic way is to use dataframes, which NetworkX also supports, btw.

rmat_graph(scale=10, a=0.57, b=0.19, c=0.19, d=0.05, create_using=None, seed=None)

Looks idiomatic to me, as the suffix _graph is in-line with how nx graph generators are named.

Just rename nvLouvain to louvain. Add a second API with advanced parameters.

Perhaps just adding the advanced parameters with sane defaults would be okay? Python users are generally ok with that.

Follow SKLearn for spectral clustering.

100% agree.


I guess overall, I see it as being awesome that you guys are designing against NetworkX's API!

In terms of the underlying data structure, I can see how an internal change to using cuDFs would be a great advantage for interoperability with the rest of the RAPIDS ecosystem. I guess in terms of user-facing design, ensuring that the appropriate connectors are present to make it usable with the NetworkX API would be very helpful.

My biggest thought on how to keep things simple for you all is to use a cuDF each for the node and edge tables, and then building the connectors necessary to ask for graph data. Multi-partite graphs can be represented using a column called "partition" (for the general case) and "bipartite" (for the case of being interoperable with NetworkX). (Side note: bipartite is definitely more commonly used right now.)

Thanks again for keeping us updated, @seunghwak!

seunghwak commented 5 years ago

To properly support Graph (undirected) and DiGraph (directed) in NetworkX, https://github.com/rapidsai/cugraph/issues/316 and https://github.com/rapidsai/cugraph/issues/317 need to be addressed first (and cugraph's C++ gdf_graph (or cugraph::graph following cudf's ongoing transition from gdf_column to cudf::column) needs to be updated to explicitly handle both undirected and directed graphs). Once we use different classes for undirected and directed graphs following NetworkX, https://github.com/rapidsai/cugraph/issues/318 can be easily solved.

Iroy30 commented 4 years ago

So inconsistencies. In the python level, add_edge_list and add_adj_list initialize a graph while add_transposed_adj_list performs conversion. In CUDA/C++, gdfadd..._list performs conversion but add_edge_list and add_adjlist perform initialization in python. In CUDA/C++, gdf..._list_view functions perform initialization. Calling both add_edge_list and add_adj_list can result in a single graph instance storing two different graphs (issue #187). And we have num_vertices, why no num_edges?

This has been addressed. With graph class changes, we now have a check https://github.com/rapidsai/cugraph/blob/5fc8b658c42b581ffc25a78447bb89c1f761357d/python/cugraph/structure/graph.py#L328 to raise exception to avoid "a single graph storing two different graphs". Also we have both number_of_vertices and number_of_edges graph functions.

And the surprise part. view_edge_list, view_adj_list, and num_vertices implicitly perform conversion. This requires more thoughts as there is a design trade-off. Implicit (or automatic) conversion allows users to type less. But this can require a significant amount of computing and also more memory for large graphs. It can be quite surprising for users to see memory allocation error after long wait when they simply query the number of vertices.

Number of vertices is executed completely in python. It stores node_Count so this type of conversion only happens once for a graph, and it later calls, this node_count value is retrieved. Viewing also performs implicit conversions for convenience and networkx consistency. However even these conversions would happen only once for a Graph and is user-friendly for situations where user inputs edgelist and wants to simply get/generate an adjlist. For benchmarking/timing these coo2csr conversions can be explicitly done before to avoid including these overheads in the overall performance analysis.

  • Option 1) initialize in instance creation (similar to cudf dataframe or series). e.g. G = cugraph.Graph(col_0, col_1, col_2, format) # col_0 and col_1 do not sound like good names but col_0 and col_1 can have different meanings in COO and CSR
  • Option 2) Initialize using view e.g. G = cugraph.view_edge_list(source_col, dest_col, value_col) # isn't weight_col a better name than value_col? and we should we use edge_list, adj_list, transposed_adj_list or COO, CSR, and CSC?

In NetworkX, the following API calls are used: G.add_edges_from(...) # add a set of edges to the graph G.add_edge(...) # add an edge to the graph G.add_nodes_from(...) # add a set of nodes to the graph G.add_node(...) # add a node In order to make cuGraph a drop-in replacement for NetworkX, at the very minimum, those same class methods should be present.

We have this API changed quite a bit. It now follows networkx from_pandas_edgelist. We can either use 1) G = cugraph.from_cudf_edgelist(df, source='src', destination='dest', edge_attr='weights', create_using = cugraph.Graph) 2) G = cugraph.Graph() G.from_cudf_edgelist(df, source='src', destination='dest', edge_attr='weights')

Iroy30 commented 4 years ago

3. Class Graph

NetworkX & cuGraph comparison

  • NetworkX has separate classes for different graph types: Graph for undirected graphs, DiGraph for directed graphs, MultiGraph for undirected multigraphs, and MultiDiGraph for directed multigraphs. cuGraph has a single Graph class.

Addressed. Cugraph now has separate Graphs.

  • NetworkX supports varying numbers of attributes for graphs, nodes, and edges. cuGraph supports only edge weights (we plan to add the node attribute “bipartite”).

Design Issues

  • We store only the attributes directly used by cuGraph algorithms inside cuGraph objects; cuGraph objects can store the node attribute “bipartite” and the edge attribute “weight” but other attributes are stored outside cuGraph objects.
  • How should we support graph updates? 1) Batch updates; 2) update the underlying data structure for efficient dynamic graph updates; 3) or disallow graph updates (in this case, users need to update graphs outside cuGraph and explicitly initialize again). In short term, we plan to support graph construction on initialization and batch updates only.
  • Multigraph support.
  • Graphs are initialized with edge data. How should we support the node attribute “bipartite”?

bipartite and multigraphs with more than one edge_attr is in progress.

API Proposal

  • Initialization

    • cugraph.Graph.init(incoming_graph_data=None, weight=’weight’) (following networkX.Graph.init()): incoming_graph_data is an edge list or any cuGraph graph object. NetworkX takes a list of tuples as an edge list. cuGraph should take a dataframe with multiple columns instead (two columns for sources and destinations, an optional column for edge weights). We may consider supporting a list of tuples and NetworkX graph objects in the future to improve NetworkX compatibility (assuming that cuDF provides an efficient mechanism to convert a list of tuples or python dictionaries to a columnar GPU DataFrame).
    • cuGraph.Graph.add_edges_from(ebunch_to_add) (following networkX.Graph.add_edges_from()): ebunch_to_add is a container of tuples (one tuple per edge) in NetworkX. cuGraph should take a dataframe with multiple columns instead (two columns for source and destination).
    • cuGraph.Graph.add_weighted_edges_from(ebunch_to_add, weight=’weight’) (following networkX.Graph.add_weighted_edges_from()): similar to add_edges_from(). weight=’weight’ points the additional column to provide edge weights.
    • cuGraph.from_cudf_adjacencylist(df, offset=’offset’, index=’index’, weight=None, create_using=None) (following networkx.convert_matrix.from_pandas_adjacency()): NetworkX takes an adjacency matrix. cuGraph should take an adjacency list (offsets + indices + optional edge weight).
    • cuGraph.from_cudf_edgelist(df, source=’source’, target=’target’, weight=None, create_using=None) (following networkx.convert_matrix.from_pandas_edgelist).
    • Same functions for DiGraph.

We support the cuGraph.from_cudf_edgelist() functionality as the way of adding edges, consistent with networkx's nx.from_pandas_edgelist()

  • Retrieval

    • cuGraph.to_cudf_adjacencylist (following networkx.convert_matrix.to_pandas_adjacency()): Note that this function returns an adjacency list instead of an adjacency matrix.
    • cuGraph.to_cudf_edgelist (following networkx.convert_matrix.to_pandas_edgelist()).
    • We may add cuGraph.to_pandas_adjacency() and cuGraph.to_pandas_edgelist() in the future, or we may handle this outside cuGraph (use cuDF for conversion outside cuGraph).

Retrieval is done using view functions

  • Additional functions from NetworkX

    • number_of_nodes().
    • number_of_edges().
    • NetworkX has many more functions to query graphs. cuGraph currently has only few. We need to add these functions over time in future releases.

Many functions have been added to Graph including number_of_nodes() and number_of_edges(). PR #822 adds many of the networkx graph functions.

BradReesWork commented 4 years ago

We are doing a review of where cuGraph is related to matching NetworkX. Identified discrepancies will be added as new issues