[BUG] Bad plotting with Neo4J APOC Querie

steph-lion commented 2 years ago

Describe the bug I'm using Neo4J and Python driver with pygraphistry to plot my results. I can't say if this is a bug, but when I run APOC Cypher queries, graphistry plotting result is not what I expected. I get what I expect with classical "MATCH" queries, but not with APOC virtual nodes and virtual relationships. I'm trying to manually build clustering algorithms because Neo4J doesn't give back a graph object when I execute them, so I created this query so I can plot something visible.

To Reproduce This is my code for the query:

import pandas as pd
import graphistry
from neo4j import GraphDatabase
#graphistry.register(api=3, username='...', password='...', bolt=NEO4J_CREDS)

label_name=input()
query="call gds.wcc.stream('"+label_name+"-interactions') yield nodeId, componentId with componentId,nodeId, apoc.coll.toSet(collect(distinct componentId)) as componentList unwind componentList as component with distinct component, componentId, gds.util.asNode(nodeId) as n call apoc.create.vNode(['component'],{component:component}) yield node call apoc.create.vRelationship(n,'BELONGS_TO',{},node) yield rel return n,rel,node limit 10"
graphistry.cypher(query).plot()

Expected behavior Something like Neo4J plotting: graph

Actual behavior What did happen: https://hub.graphistry.com/graph/graph.html?dataset=3e1bb1d25e374bb290c4846783983eb2 Here you can see the plot. The couples are all attached and I need to move them manually to make the edge appear between them. But this is an example with 10 couples, but I'm going to plot something like 750k nodes and I can't move them manually to see clusters results.

Browser environment (please complete the following information):

OS: Windows 10
Browser : Chrome
Version 98.0.4758.102

Graphistry GPU server environment

Where run : Hub

PyGraphistry API client environment

Where run: VS Code Python script
Version: 0.20.5
Python Version 3.10.2

Additional context I'm sorry but I'm learning Neo4j and pygraphistry for the first time, so I don't know how to print clusters algorithms. Also, if I change {component:component} into {name:component} I get a python error:

pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'int' object", 'Conversion failed for column name with type object') I don't know what that is, so I changed "name" to "component".

lmeyerov commented 2 years ago

Hi @steph-lion , interesting!

I don't think we've seen virtual node/relns so far, so not exactly surprised they didn't work out of the box. If so, should be a surprisingly low lift as our neo4j_bolt_graph->dataframe conversion portion is only ~100 loc, so happy to help figure out, see below.

Also, just to help prioritize here, feel free to ping in Slack with any more sensitive info etc

I bet if you inspected the resulting node/edge dataframes, we might see more. I'm curious, does this run and give expected output, or can you share a deeper exception output?

label_name=input()
query="..."
g = graphistry.cypher(query)
print(g._nodes)
print(g._edges)

The underlying code is fairly thin, in case you're curious and want to explore

how the cypher() call invokes the driver and then convert the bolt-format results to _nodes and _edges pandas/arrow data frames: https://github.com/graphistry/pygraphistry/blob/bf99c1827510e98ea15cbd745f3b6755feee3ac3/graphistry/PlotterBase.py#L1802
actual conversion (tiny!): https://github.com/graphistry/pygraphistry/blob/53bd7a779b9efc216301fcee94b493de9184cbc2/graphistry/bolt_util.py#L59

Something like

from neo4j import GraphDatabase, Driver
driver = GraphDatabase.driver(...)

from local_copy_of_those_snippets import bolt_graph_to_edges_dataframe, bolt_graph_to_nodes_dataframe
with driver.session() as session:
    bolt_statement = session.run(query, **params)
    graph = bolt_statement.graph()
    edges = bolt_graph_to_edges_dataframe(graph)
    nodes = bolt_graph_to_nodes_dataframe(graph)

Alternatively, if you can help with a sample of how to populate a db for this data (and you already shared a generic query, afaict!), will become a lot easier for us to figure out what should be happening with step 2. E.g., something we can add to https://github.com/graphistry/pygraphistry/blob/master/test/db/neo4j/add_data.sh and then play with on our side.

(Edit): I should add -- you can always manually go from neo4j -> dataframe -> graphistry for virtual types, though of course it'd be cooler if we can natively support :)

nodes_df = ...
edges_df = ...
g = graphistry.nodes(nodes_df, 'my_id_col').edges(edges_df, 'my_src_col', 'my_dest_col')
g.plot()

But of course it'd be better to natively support, so I'm curious on 1-3, or if you can get a flow working for 4 :)

lmeyerov commented 2 years ago

(For tracking: if/when we confirm it's virtual types, will file a parallel ticket for tracking an enhancement to support virtual types)

steph-lion commented 2 years ago

Hi, thanks for the answer, this is also my first github issue I make, it seems I got a good one!

I'm trying what you said.

This is the result of the nodes and edges print with the query above and your code:
Honestly, I don't know how to do that right now...
Unfortunately I don't have data with me, I think that the Movie set offered by Neo4J should work. Just create a query to make virtual nodes and virtual relationships and return them. I just ran this query on Movie database and Graphistry drew it as expected, I don't know why it worked now:

graphistry.cypher("call apoc.create.vNode(['Greeting'],{greeting:'Hello!'}) yield node with node as n1 call apoc.create.vNode(['Greeting'],{greeting:'Hi!'}) yield node with n1, node as n2 return apoc.create.vRelationship(n1,'SIMILAR_TO',{},n2) as rel,n1,n2").plot()

https://hub.graphistry.com/graph/graph.html?dataset=3f0046843aae4681a3d8b57efbeed634

Same as 2, I don't know what to do... Hope this comment helps to clarify what's happening.

lmeyerov commented 2 years ago

:)

Ok this is helping a lot, thank you!

Also, I just realized I misread the issue, I think the initial main issue is more of a visual settings thing where you wanted to be zoomed in more so you could see the edges. You can bring the elements closer together and it'll autozoom :

g = (
  graphistry
    .cypher(query)
    .settings(url_params={
        "strongGravity": "true"
    })
)
g.plot()

More options at: https://hub.graphistry.com/docs/api/1/rest/url/

Of interest are also gravity and pointSize. These correspond to values in the setting's UI panel, which you can play with and then bake in. For bigger graphs, you'll probably want different settings, so an option is picking based on graph size (len(g._edges) + len(g._nodes))

The additional item was about pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'int' object", 'Conversion failed for column name with type object') . I'm not sure either -- Arrow expects everything of the same field name to be the same type, so when in doubt, I'd start with checking values in ._edges / ._nodes and worst case, converting to strings:

cleaned_nodes_df = g._nodes.assign(some_col=g._nodes['some_col'].astype(str))
g2 = g.nodes(cleaned_nodes_df)
g2.plot()

But that assumes it's a data typing issue and am not sure :)

steph-lion commented 2 years ago

Well, right now I just added the setting "strongGravity":"true" and it seems that worked, at I least I can see edges now. I hope it will get fixed also for thousands of nodes together, since I need to plot some clusters. Here is the result: https://hub.graphistry.com/graph/graph.html?dataset=508115478cb243eb93af529afce6273e

Is there any channel where I can ask "stupid" questions about graphistry? I have a lot ones and StackOverflow is not my friend in this case... Thanks for the help to this ticket!

EDIT: With 100 nodes still same problem: https://hub.graphistry.com/graph/graph.html?dataset=a4ad840515bc4b6fba7b999b99e611c3

lmeyerov commented 2 years ago

GitHub is great even for simple stuff, others can search this too, so it's a useful help to future folk :)

The Slack channel's #help is great too, and for bigger / work stuff, our ZenDesk

   .settings(url_params={
        "strongGravity": "true",
        "pointSize": 0.3
    })

Seems to do it for the bigger link afaict . This kind of stuff gets specific to different graphs, so we do automate a bunch of it (you'll notice relative size changes as you zoom in/out), some last-mile tweaks do help in practice, esp. for extreme cases

lmeyerov commented 2 years ago

Closing:

main issue was just wanting pointSize to be smaller via one of the APIs
virtual nodes seem to work fine

graphistry / pygraphistry

[BUG] Bad plotting with Neo4J APOC Querie #311