neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
607 stars 159 forks source link

Obtaining ids for relationships traversed in path finding algorithms #105

Open hbldh opened 3 years ago

hbldh commented 3 years ago

Hello,

First of all, thank you for GDS, it is an amazing set of tools and I am delighted by the 1.5.0 release, which I am using with the 4.2.3 release of the Neo4j server.

I have an issue where I cannot obtain enough data from the relationships traversed in the path finding algorithms though. There is a community discussion around this as well, and it seems to not have arrived at a solution.

Say that you have a graph similar to the road network examples in the GDS documentation:

CREATE (a:Location {name: 'A'}),
       (b:Location {name: 'B'}),
       (c:Location {name: 'C'}),
       (d:Location {name: 'D'}),
       (e:Location {name: 'E'}),
       (f:Location {name: 'F'}),
       (a)-[:ROAD {cost: 50}]->(b),
       (a)-[:ROAD {cost: 50}]->(c),
       (a)-[:ROAD {cost: 100}]->(d),
       (b)-[:ROAD {cost: 40}]->(d),
       (b)-[:RAIL {cost: 20}]->(d),
       (c)-[:ROAD {cost: 40}]->(d),
       (c)-[:ROAD {cost: 80}]->(e),
       (d)-[:ROAD {cost: 30}]->(e),
       (d)-[:ROAD {cost: 80}]->(f),
       (e)-[:ROAD {cost: 40}]->(f);

I have added a second relationship with a new label between B and D compared to your example.

Now, you want to apply e.g. Yen's K Shortest Paths algorithm and find paths between A and F:

MATCH (source:Location {name: 'A'}), (target:Location {name: 'F'})
CALL gds.beta.shortestPath.yens.stream({
    nodeProjection: 'Location',
    relationshipProjection: [
        "ROAD",
        "RAIL"
    ],
    relationshipProperties: "cost",
    sourceNode: id(source),
    targetNode: id(target),
    k: 3,
    relationshipWeightProperty: 'cost',
    path: true
})
YIELD index, nodeIds, costs, path
RETURN
    index,
    [nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames,
    costs,
    path
ORDER BY index

If I look at the path results of the first record I see

{
    "start": {
        "identity": 0,
        "labels": [
            "Location"
        ],
        "properties": {
            "name": "A"
        }
    },
    "end": {
        "identity": 5,
        "labels": [
            "Location"
        ],
        "properties": {
            "name": "F"
        }
    },
    "segments": [
        {
            "start": {
                "identity": 0,
                "labels": [
                    "Location"
                ],
                "properties": {
                    "name": "A"
                }
            },
            "relationship": {
                "identity": -1,
                "start": 0,
                "end": 1,
                "type": "PATH_0",
                "properties": {
                    "cost": 50.0
                }
            },
            "end": {
                "identity": 1,
                "labels": [
                    "Location"
                ],
                "properties": {
                    "name": "B"
                }
            }
        },
        [...]
    ],
    "length": 4.0
}

I can get the nodes, with ids and properties, but I have no identification to match which of the available relationships that was traversed in the segments[i].relationship specification. It creates a new PATH_0 type and does not relay any properties except the one specified in the relationshipWeightProperty. All I have to match with is the cost property (which in most cases in my usecase will be enough, even though I guess that matching needs to be done in the application by repeatedly querying the database for relationships between these nodes with the corresponding cost).

I have tried with native projections, anonymous projections and cypher projections with the same results.

Questions:

  1. Is it possible to do right now?
  2. Is it possible to implement it?
  3. If so, I might be able to make an attempt to implement it if you would accept that. That is if there isn't a good reason for creating PATH_X type relationships and returning those instead that I am unaware of?
Mats-SX commented 3 years ago

Hello @hbldh and thanks for reaching out to us, and also thank you for the kind words. We are happy to be able to provide the library.

It seems to me that what you are asking for is relationship identities. Neo4j (the database) obviously has this notion, as the property graph data model it supports mandates it. However, the in-memory GDS graphs does not maintain a mapping between the source relationships in the database and the projected relationships of the in-memory graph. For nodes, we do maintain such a mapping. This enables GDS to find what node in Neo4j was projected to GDS and write back a property to that very node (containing e.g. a page rank score or a community id). GDS does not do anything similar for relationships; whenever we write relationships to the database, they are always completely new.

In our view, the GDS in-memory graph is always a projection or abstraction of the Neo4j database graph. For example, we support deduplication of parallel relationships into an aggregated relationship that explicitly does not exist in the Neo4j database. Similarly, the gds.alpha.collapsePath() procedure can collapse a multi-step pattern into a single relationship, that similarly does not exist in Neo4j. Also, any relationship (or node, for that matter) projected with a Cypher projection is not guaranteed to exist in the database. While it is possible to write these relationships back to the database, this is considered not a primary use case. It can often be useful to keep the base graph model separate from the analytical model.

Now, it wouldn't be impossible to add a mapping layer to map also Neo4j relationship ids, but with the current design of the in-memory graph it would be very difficult to achieve, and we're not sure that's a direction we would want to go in. It would probably mean that the in-memory graph would grow in size substantially. One idea could be to project the original relationship id as a property to the relationship, which could then be used when reaching back to Neo4j in the cases where an abstract relationship has not been created (due to deduplication or collapsePath or Cypher projection or otherwise).

As a separate observation, I'm wondering why you need the exactness of the relationship id? The shortest path problem will always find the shortest (or cheapest, if weighted) path, so you can be sure that if a path passed from node x to node y then it will have chosen the relationship between x and y that has the lowest weight. If their is a tie, the choice is arbitrary. With that insight, when would you need something more precise?

In short, the answers to your questions would be:

  1. No.
  2. Maybe, but not easily.
  3. I can not guarantee that we would accept it, but we would have a look at it at least.

When you say "creating PATH_X type relationhips", what do you mean exactly?

hbldh commented 3 years ago

Thanks for the verbose answer; this was more than I had expected and hoped for!

I realize I did not emphasize one important aspect of this: I was planning on using the Yen's K Shortest Paths for the application I was writing, and as such I need to be able to discern between different paths returned. It is not merely the optimal path that I am interested in, and given that requirement I have a need to be able to match the returned segments[...].relationship documents to the relationships in the original graph, and to do this I have only the specified relationshipWeightProperty to link them with.

I can understand your reasoning with collapsePath; it makes understanding the impressive performance of the GDS algorithms easier and for all usecases except Yen's K Shortest Paths this would not be a problem. But I struggle to see how a K-shortest paths can be truly useful if not for relationship ids. If one takes making sure that the relationshipWeightProperty value is unique for one node's relationships, then a matching can be done with some additional Cypher-queries, and for me this is enough. But still, I thought that this might be a problem worth exploring.

If you say that it might be done, but not easily, I will take that as a discouragment to try. It will most probably mean that my changes will make the K-shortest path problem into a special case, and had I been reviewing that PR I would not have accepted it due to the added maintenance weight.

When you say "creating PATH_X type relationhips", what do you mean exactly?

I merely referred to the "type": "PATH_0" properties of the returned relationships. For the best path it is PATH_0, for the second best PATH_1 and so forth. Generalized to PATH_X in my description. What I meant was that I get relationships with a new type in a projected graph instead of existing relationships in the actual graph, and I wondered if that was by design or a byproduct of the current implementation.

I have nothing else to add. I have found a way to handle this and I am content. If you do not see any problem in the way it is implemented now, then we are both content. Thank you once again.

Mats-SX commented 3 years ago

@hbldh That sounds all good.

I should add that relationship ids will very likely make it into a future version of GDS. Tangentially related to your issue here, but also to validate another idea, we made a POC that would enable you to map relationship ids as relationship properties. However, we realised that even with the ids attached to the relationships in the in-memory graph, we would need to design a model for the pathfinding algorithms to use them in a good way, with fallback semantics when there are no relationship ids. We deem this a bigger project which we hope to tackle in the future.

I think you are aptly highlighting a limitation of the library as it stands, but perhaps as an answer to the question of how k-shortest paths could be useful without identifying individual relationships, I could highlight that knowing the top-k total costs of going from a to b can also be used for valuable insight. For example, if going from data center A to B with cost 6 is minimal, but the second-cheapest has cost 60, we can know something about the risk with the line being broken. Of course, this doesn't tell you exactly where to look to make some action, but it does provide you with some answer.

Anyway, I'm glad you found a workaround, and we'd be happy to receive any further issues that you may encounter in the future.

All the best Mats

gdantimi commented 2 years ago

Hi,

sorry to resuscitate this post but I ran in the same issue. Going from A to B where A and B are connected by R1 and R2 for example, it is true that the algo will find the cheapest between R1 and R2, so after we get the result we can run another query to retrieve the cheapest relationship.

But in my case for example I need to count the number of hops, but only if they are of relationship R1. I can use a workaround because I know that R2 is present only if the cost is less than X, but there might be corner cases where this is not true.

Adding custom properties to the virtual relationship would already work (like, custom properties, not really projected) but also limiting since strings are not suppoerted for vrelationships (yes, we could use numbers and map them somehow, but again it is a workaround).

I understand this is not a top issue, but it would be nice if there would be a way to activate the possibility to map the relationships, maybe through configuration (by default no mapping).

What's the status with this?

Mats-SX commented 2 years ago

The status at the moment is that we are not looking into these algorithms for the upcoming release. I will note your request and re-open this issue until we have a more conclusive answer.

The number of requests for relationship ids is increasing, which does increase the likelihood that it out-competes other feature work.