neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
641 stars 161 forks source link

GDS - RandomWalk - Unable to load NODE #337

Open Mintactus opened 3 days ago

Mintactus commented 3 days ago

Neo4j 5.25.1 GDS 2.12 GDS Python Client 1.12

The randomWalk algo doesn't load my sourceNode, details below:

My in memory GDS graph has been build from a pandas DataFrame using the construct method of gds, so it doesn't exists and will not exists on disk, its intended for an in memory analysis only.

Here is the content of the in memory extracted from gds.graph.nodeProperty.stream

             nodeId  propertyValue nodeLabels
0 6335695024714629015 -0.00003 
1 531768015437695177 0.00009 
2 3558886278460545694 -0.00012 
3 7960371801618416072 -0.00006 
4 688712822280937494 0.00009 
5 6445645390101772454 0.00000 
6 4640442843099832304 -0.00006 
7 6026970582286088324 0.00006 
8 5356341080109221825 0.00003 
9 1843909622001289035 0.00006 
10 5984421542275516993 -0.00009 
11 1113611838033320553 -0.00003 
12 4162479979561917907 0.00003 

When trying to run randomWalk

    sourceNode = self.markov_chain_nodes['nodeId'].last() <- This output an signed int64
    random_walk_config = {
        'sourceNodes': [sourceNode],
        'walkLength': FUTURE_SIZE,
        'walksPerNode': 1,
        'relationshipWeightProperty': 'transition probability',
        'concurrency': 4
    }
    future = self.gds.randomWalk.stream(self.graph, **random_walk_config)

I got this error, {message: Failed to invoke procedure gds.randomWalk.stream: Caused by: org.neo4j.internal.kernel.api.exceptions.EntityNotFoundException: Unable to load NODE 4162479979561917907.}.

But the node id 4162479979561917907 clearly exist in the in memory graph

I read that I'm suppose to use gds.find_node_id to match the sourceNode, but this is an in memory graph only and doesn't need to become an on-disk graph. Having to create an on disk graph just to make it work doesn't make any sens to me.

This might also be considered as a feature request then...

Thanks for your support :)

IoannisPanagiotas commented 2 days ago

Hi @Mintactus ,

I have looked into your issue. I can verify there is a bug when working with graphs not backed by a database for randomwalk. We have applied a fix which should be out in the next gds release, but I am not sure when that is going to be.

In the meantime, as a workaround, I would suggest the following

Instead of running randomWalk on the gds python client, you can run with the neo4j python client and call a cypher query directly. There are instructions on https://neo4j.com/docs/python-manual/current/ for how to do this.

The Cypher query that you need is the following, where X is
sourceNode = self.markov_chain_nodes['nodeId'].last()

 CALL gds.randomWalk.stream(
  'myGraph',
  {
    sourceNodes: X,
    walkLength: 3,
    walksPerNode: 1,
    randomSeed: 42,
    concurrency: 1
  }
)
YIELD nodeIds

I believe that execute_query in the page I shared should work.

This should work as it avoids doing the faulty computation. Let us know if you need any help in running that query.

FlorentinD commented 2 days ago

you also can still use the GDS client -

gds.run_cypher("""CALL gds.randomWalk.stream(
  'myGraph',
  {
    sourceNodes: X,
    walkLength: 3,
    walksPerNode: 1,
    randomSeed: 42,
    concurrency: 1
  }
)
YIELD nodeIds
""") 
Mintactus commented 1 day ago

Thank you guys,

@IoannisPanagiotas @FlorentinD

I'm glad to know I wasn't crazy, I'have used it for a while and on that one I couldn't explain what i was doing wrong.

Amazing support

Mintactus commented 1 day ago

I did some deeper test and investigation,

If I'm right, graph created using the construct method ( graph that do not exists on disk ) will use the nodeId provided in the dataframe as actual nodeIds usable as sourceNodes inside an algo. Which seems to be right based on the picture provided.

As suggested, I tried the above using only the cypher statement inside the browser instead of the GDS Python Client randomWalk method, but still GDS is not able to locate the nodeID. So it seems the problem is not comming from the GDS Python Client but rather GDS itself not being able to locate a nodeID on a not existant on disk graph.

To reproduce the issue, you basically build an in-memoery graph from a dataframe using the construct method , then try to run the randomWalk algo using cypher with any sourceNode in it, it fails.

Unless I missed something in the doc, this behavior obliged the dev to:

-Export it's in-memory graph into a new database ( Because it has to be a new, you can't use the one the gds initiate it's connection with ) -Create a new gds object linked to this new database -Create a new native in-memory projection from from this new database -Then run the algo from this new projection

Kind of a huge workaround making the usuge of in-memory graph drasticly less exiting to use. But thanks for your support, hopefully a patched version will come out soon :)

nodeIdProblem

IoannisPanagiotas commented 4 hours ago

@Mintactus

Please remove the 'path' from the yields as in the query we shared above! The bug is contained in that part because it relies on having a neo4j graph. It should run normally after that.

Best.