rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.72k stars 302 forks source link

[ENH] plot() / graphistry binding #1011

Closed lmeyerov closed 4 years ago

lmeyerov commented 4 years ago

Describe the solution you'd like

Forwards-compatible cugraph plotting bindings with graphistry:

1. in cuGraph:

G.to_graphistry().plot()

This would be built from G.__to_graphistry_nodes() and G.__to_graphistry_edges() which return DF's, and based on whatever cugraph settings, add bindings like graphistry.bind(source='s', destination='d', edge_weight='z', ...). If cugraph ever changes those settings, it can update plotter bindings with them.

2. in graphistry:

a

graphistry.graph(G).plot()

b

graphistry.nodes(G).plot()
graphistry.edges(G).plot()

c

graphistry.graph(G).enrich_graph(G2).plot()
graphistry.nodes(G).enrich_nodes(G2).plot()
graphistry.edges(G).enrich_edges(G2).plot()

The graphistry side would use the cugraph helpers

d

G = graphistry.to_cugraph()

Additional context Basics seem simple. cugraph api is a bit of a moving target, so establishing the core helpers on cugraph side would be a big help. for a sense, see code example here:

import cudf, cugraph, graphistry

#free account @ https://www.graphistry.com/get-started
graphistry.register(api=3, protocol='https', server='hub.graphistry.com', username='***', password='***')

n = 100
e = cudf.DataFrame({
    's': range(n), 
    'd':  [(x+1) % n for x in range(n)], 
    'w': cudf.Series([x/100 for x in range(n)], dtype='float32')
})
G = cugraph.from_cudf_edgelist(e, source='s', destination='d', edge_attr='w')
graphistry\
    .edges(G.view_edge_list())\
    .bind(source='src', destination='dst', edge_weight='weights')\
    .settings(url_params={'edgeInfluence': 3})\
    .plot()

=>

import cudf, cugraph, graphistry
graphistry.register(api=3, protocol='https', server='hub.graphistry.com', username='***', password='***')

n = 100
e = cudf.DataFrame({
    's': range(n), 
    'd':  [(x+1) % n for x in range(n)], 
    'w': cudf.Series([x/100 for x in range(n)], dtype='float32')
})
G = cugraph.from_cudf_edgelist(e, source='s', destination='d', edge_attr='w')
G.to_graphistry().plot()
BradReesWork commented 4 years ago

@lmeyerov

I have some of concerns and issues with this feature request. The first, and major, issues is that Graphistry is closed source, and we want to only adds conversion routines for open-source products.

Also, it seems like you are doing extra work. Data starts as a pseudo-property graph in cuDF, where it is rich with attributes. From there a Graph is created, which really just maintains references back to the dataframe. You then want the Graph to create a new DataFrame that sounds similar to the original.

lmeyerov commented 4 years ago

Hi @BradReesWork , thanks for the reasoned response. For the major concern, maybe not obvious, PyGraphistry is an OSS project that is increasingly used as a thick Swiss army knife stuff for going between graph data sources <> pydata, not just our proprietary rapids-native plotter backend. Ex: cugraph's hypergraph pr comes from working off of pygraphistry's code & tests.

On the technical side for cugraph, it's currently unpredictable if some program (user, graphistry, cugraph), across different data, will fail because one part used col "source"', another"src", and some change does"src_id"`.

Maybe we can break this down to exposing two thin & generic interop APIs. The result should be less heart burn for both direct users + framework writers. Then this ENH hollows out to tinier features on top of that. Happy to close this ENH and file those, lmk.

====

  1. Property graph settings introspection, and ideally, uniform create/set/get:
# already exists, includes attr cols
G.view_edge_list() -> cudf.DataFrame 

# missing; current 'G.nodes() -> Series' precludes getting node attr cols
G.view_node_list() -> cudf.DataFrame 

#implicit, inconsistent across indiv calls, & bindings changing across upgrades
G.bindings() -> { 
  'edges': {'source_id': [str], 'destination_id': [str], ?'weight': str, ...},
  'nodes': {'node_id': [str], ?'weight': str,
  'settings': {...}
}
#Or, per-attrib getters: G._edge_source_id :: [str], G._node_id :: [str], ...
#. ... And same for attribs like any meaningful settings like symmetrized, is_directed, and anything else cugraph sees as important...

That gets read-only use cases far. Eventually, ideally also have reliable dual create / update methods like graph(node_list=...), set_node_list(), clone() for enriching & interactive analytics flows, and eventually, for the bindings too. I can see that being out-of-scope, but at least getting schema introspection would help writing stable readers.

===

  1. Pluggable plotter:

Users of cugraph would benefit from seeing intermediate + final results from a plotter of choice:

cugraph.set_plotter(engine=some_plugin)
G.plot()

where every engine implements a thin interface like class { 'init': ..., 'plot': (self, data : cuGraph, **kwargs, *args) -> Any }

Implementors of engine would benefit from (1) method for a more predictable target across upgrades :)

BradReesWork commented 4 years ago

@lmeyerov there are still somethings here that I I still think go beyond what cuGraph should be doing. But delaying fully commenting on issue until after we have property graph support working (very soon). closing issue but will re-address in a month