neo4j-contrib / neo4j-graph-algorithms

Efficient Graph Algorithms for Neo4j
https://github.com/neo4j/graph-data-science/
GNU General Public License v3.0
772 stars 194 forks source link

ArrayIndexOutOfBoundsException while running shortestPath() on a large graph #659

Open aalampally opened 6 years ago

aalampally commented 6 years ago

I am getting a weird error when using shortestPath algorithms in graph-algorithms. In the graph data I have Person node with primaryEmail as once of the attributes. This contains emails of multiple domains. e.g., a@xyz.com, x@abc.com etc..

When I run the following query MATCH (source:Person{primaryEmail:'a@xyz.com'}), (target:Person{primaryEmail:'b@xyz.com'}) CALL algo.shortestPath.stream(source, target, "cost",{ nodeQuery:'MATCH(n:Person) RETURN id(n) as id', relationshipQuery:'MATCH(n:Person)-[r:connected_to]-(m:Person) RETURN id(n) as source, id(m) as target, (1.0/(r.weight+1.0)) as weight', graph:'cypher'}) YIELD nodeId, cost MATCH (other:Person) WHERE id(other) = nodeId RETURN other.name, cost

It is giving me the following error: Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure 'algo.shortestPath.stream': Caused by: java.lang.ArrayIndexOutOfBoundsException: -1073739909

Just as a tweak, I tried to limit the nodes to a particular domain of email and it started running.The updated query is: MATCH (source:Person{primaryEmail:'a@xyz.com'}), (target:Person{primaryEmail:'b@xyz.com'}) CALL algo.shortestPath.stream(source, target, "cost",{ nodeQuery:'MATCH(n:Person) where n.primaryEmail ends with @xyz.com RETURN id(n) as id', relationshipQuery:'MATCH(n:Person)-[r:connected_to]-(m:Person) RETURN id(n) as source, id(m) as target, (1.0/(r.weight+1.0)) as weight', graph:'cypher'}) YIELD nodeId, cost MATCH (other:Person) WHERE id(other) = nodeId RETURN other.name, cost

Is the error because the virtual graph which is created by filtering the relationships, creating a graph which is very big and exceeding the memory? Any help is much appreciated!!

mneedham commented 6 years ago

Can you check how many records are returned by the node and relationship projection queries? If you replace the last lines with:

Return count (*)

That should do it

On Mon, Jun 11, 2018, 17:49 Anirudh Alampally notifications@github.com wrote:

I am getting a weird error when using shortestPath algorithms in graph-algorithms. In the graph data I have Person node with primaryEmail as once of the attributes. This contains emails of multiple domains. e.g., a@xyz.com, x@abc.com etc..

When I run the following query MATCH (source:Person{primaryEmail:'a@xyz.com'}), ( target:Person{primaryEmail:'b@xyz.com'}) CALL algo.shortestPath.stream(source, target, "cost",{ nodeQuery:'MATCH(n:Person) RETURN id(n) as id', relationshipQuery:'MATCH(n:Person)-[r:connected_to]-(m:Person) RETURN id(n) as source, id(m) as target, (1.0/(r.weight+1.0)) as weight', graph:'cypher'}) YIELD nodeId, cost MATCH (other:Person) WHERE id(other) = nodeId RETURN other.name, cost

It is giving me the following error: Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure 'algo.shortestPath.stream': Caused by: java.lang.ArrayIndexOutOfBoundsException: -1073739909

Just as a tweak, I tried to limit the nodes to a particular domain of email and it started running.The updated query is: MATCH (source:Person{primaryEmail:'a@xyz.com'}), ( target:Person{primaryEmail:'b@xyz.com'}) CALL algo.shortestPath.stream(source, target, "cost",{ nodeQuery:'MATCH(n:Person) where n.primaryEmail ends with @xyz.com RETURN id(n) as id', relationshipQuery:'MATCH(n:Person)-[r:connected_to]-(m:Person) RETURN id(n) as source, id(m) as target, (1.0/(r.weight+1.0)) as weight', graph:'cypher'}) YIELD nodeId, cost MATCH (other:Person) WHERE id(other) = nodeId RETURN other.name, cost

Is the error because the virtual graph which is created by filtering the relationships, creating a graph which is very big and exceeding the memory? Any help is much appreciated!!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/neo4j-contrib/neo4j-graph-algorithms/issues/659, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAzpMqtc1g9I203SV0rDQP8rTzXyB9kks5t7p-lgaJpZM4UjCiG .

aalampally commented 6 years ago

Hey @mneedham The no.of nodes after node filtering are: 77974 and the no.of relationships after relationship filtering are: 726748.

aalampally commented 6 years ago

I am assuming that creating a virtual graph is the issue here. So, to simplify the query, I created a relationship property "sssp_distance", which is essentially (1.0/(r.weight+1.0)). So, I could now just mention the weightProperty and relationship in which I have to look for.

My updated query would look like this:

MATCH (start:Person{primaryEmail:'a@xyz.com'}), (end:Person{primaryEmail:'b@xyz.com'}) CALL algo.shortestPath.stream(start, end, "sssp_distance", {relationshipQuery: "connected_to"}) YIELD nodeId, cost MATCH (other) WHERE id(other) = nodeId RETURN other, cost

Even after this simplification, I get the following error Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure 'algo.shortestPath.stream': Caused by: java.lang.ArrayIndexOutOfBoundsException: -2147482401

So, I tried APOC's dijkstra algorithm and it was able to give the shortest path using the following query: MATCH (start:Person{primaryEmail: 'a@xyz.com'}), (end:Person {primaryEmail: 'b@xyz.com'}) CALL apoc.algo.dijkstra(start, end, "connected_to", "sssp_distance", 1.0, 1) YIELD path, weight RETURN path, weight

The main reason, I chose graph-algorithms over APOC was that, I could create a virtual graph, this way I wouldn't have to create a new relationship property sssp_distance.

Please Let me know if I still have any workaround for this

mneedham commented 6 years ago

@aalampally seeing as you got the error when using the non Cypher projection approach it sounds like the error is happening somewhere else.

Are you able to narrow the error down on a smaller dataset that you'd be able to share with us so we can reproduce it and figure out what's going on?

aalampally commented 6 years ago

I have tried the same on a smaller dummy dataset CREATE (a:Loc{name:'A'}), (b:Loc{name:'B'}), (c:Loc{name:'C'}), (d:Loc{name:'D'}), (e:Loc{name:'E'}), (f:Loc{name:'F'}), (a)-[:ROAD {cost:50}]->(b), (a)-[:ROAD {cost:50}]->(c), (a)-[:ROAD {cost:100}]->(d), (a)-[:RAIL {cost:50}]->(d), (b)-[:ROAD {cost:40}]->(d), (c)-[:ROAD {cost:40}]->(d), (c)-[:ROAD {cost:80}]->(e), (d)-[:ROAD {cost:30}]->(e), (d)-[:ROAD {cost:80}]->(f), (e)-[:ROAD {cost:40}]->(f), (e)-[:RAIL {cost:20}]->(f);

It works fine both with Cypher projection and non-Cypher projection. I am assuming that if similar data is multiplied to ~0.7 million edges and 70k nodes (data size similar to what I have), the error could be reproduced.