neo4j-contrib / neo4j-graph-algorithms

Efficient Graph Algorithms for Neo4j
https://github.com/neo4j/graph-data-science/
GNU General Public License v3.0
767 stars 196 forks source link

Max Cost param in BFS/DFS #899

Open tomasonjo opened 5 years ago

tomasonjo commented 5 years ago

I used to think that the max cost parameter stops when the sum of all relationship weights in the traversal hits the max cost value, but I found out that this is not the case. What I think it does it returns all the nodes that can be reached within the max cost (weight) specified. Basically, like finding all the nodes that have the weighted shortest path less than or equal to max cost.

What is weird to me is that BFS and DFS algos don't always return the same result.

Graph:

import nodes

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.stations.csv" as row 
MERGE (s:Station{id:row.id}) 
SET s += apoc.map.clean(row,['id'],[])

import rels

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/nicola/tubemaps/master/datasets/london.connections.csv" as row 
MATCH (s1:Station{id:row.station1}),(s2:Station{id:row.station2}) 
MERGE (s1)-[:CONNECTION{time:toInteger(row.time),line:row.line}]->(s2)

BFS:

MATCH (start:Station{name:'Picadilly Circus'})
CALL algo.bfs.stream('Station', 'CONNECTION', 'BOTH', id(start), 
  {maxCost:3, weightProperty:'time'})
YIELD nodeIds
UNWIND nodeIds as nodeId
RETURN algo.asNode(nodeId).name as station

Results are

station
"Picadilly Circus"
"Green Park"
"Leicester Square"
"Oxford Circus"
"Charing Cross"
"Bond Street"
"Covent Garden"
"Tottenham Court Road"
"Embankment"

DFS:

MATCH (start:Station{name:'Picadilly Circus'})
CALL algo.dfs.stream('Station', 'CONNECTION', 'BOTH', id(start), 
  {maxCost:3, weightProperty:'time'})
YIELD nodeIds
UNWIND nodeIds as nodeId
RETURN algo.asNode(nodeId).name as station

Results are

station
"Picadilly Circus"
"Charing Cross"
"Embankment"
"Oxford Circus"
"Bond Street"
"Leicester Square"
"Covent Garden"
"Green Park"

So the difference is in returning "Tottenham Court Road". While the following picture might explain why this happens:

tottenham

it is still weird because both algorithms don't stop when they first encounter the sum of all weights in traversal to be greater than maxCost, but try to find all the nodes that are less than or equal to maxCost away from starting node.

So I would assume the DFS should traverse both relationships and also return "Tottenham Court Road" as a result. Basically, the results should always be the same for BFS and DFS when using maxCost parameter?