opencypher / cypher-for-gremlin

Cypher for Gremlin adds Cypher support to any Gremlin graph database.
Apache License 2.0
359 stars 48 forks source link

Unable to filter out paths containing self edges using a cypher Match query. #310

Open pushkarnagpal opened 5 years ago

pushkarnagpal commented 5 years ago

Trying the following query to get a path using a match query to filter out paths that contain self edges (node with edges to itself to avoid increased length of path:

match p = (base:party{node_id:'245793137123'})<-[trail:ownership*..]-(leaves:party) where none( rel in relationships(p) where startNode(rel) = endNode(rel) ) return count(p)

But this query is unable to filter out the data and I get the entire set.

dwitry commented 5 years ago

Hello @pushkarnagpal,

Are you sure you've provided a valid Cypher query?

It does not compile in Cypher for Gremlin and Neo4j gives error about multiple WHERE cases.

image

pushkarnagpal commented 5 years ago

Hello @pushkarnagpal,

Are you sure you've provided a valid Cypher query?

It does not compile in Cypher for Gremlin and Neo4j gives error about multiple WHERE cases.

image

Hi @dwitry I've updated the query now. (had accidentally written not instead of none in the query.

match p = (base:party{node_id:'245793137123'})<-[trail:ownership*..]-(leaves:party) where none( rel in relationships(p) where startNode(rel) = endNode(rel) ) return count(p)

Please look into the issue now. Thanks!

dwitry commented 5 years ago

Variable length path in loops is a complex case, which is tricky to implement in Gremlin to fully cover all the edge cases. I've invested a lot of time trying to properly implement it (1, 2) but the solution still not universal. Unfortunately can not provide any estimates when this will be improved.

pushkarnagpal commented 5 years ago

Variable length path in loops is a complex case, which is tricky to implement in Gremlin to fully cover all the edge cases. I've invested a lot of time trying to properly implement it (1, 2) but the solution still not universal. Unfortunately can not provide any estimates when this will be improved.

Ahh okay. I was working on a fraud detection case.and got stuck here. I saw this query conversion from the cypher in the gremlin-server.log file, and I see a .neq(' cypher.null') in the query and i assume this step is occuring at the none where clause. I think this is the root cause of the bug.

g.V().as('base')
    .hasLabel('party').has('node_id', eq('245793137123'))
    .repeat(__
                .inE('ownership')
                .as('trail')
                .aggregate('  cypher.path.edge.p')
                .outV()
        )
    .emit()
    .until(__
            .path()
            .from('base')
            .count(local)
            .is(gte(21))
        )
    .as('leaves')
    .hasLabel('party')
    .path()
    .from('base')
    .as('p')
    .where(__
            .not(__.V().hasLabel('party').outE('ownership').inV().as('  GENERATED4').where(__.select('  GENERATED4').where(eq('leaves'))))
        )
    .optional(__
        .select(all, 'trail').as('trail')
        )
    .select('p')
    .is(neq('  cypher.null'))
    .count()
    .project('count(p)')
    .by(__.identity())

Hope I could help somehow with this. Thanks!

dwitry commented 5 years ago

.neq(' cypher.null') is guard to handle cypher.null which represents null in Cypher for Gremlin.

We may call "root cause" the following snippet:

.repeat(__
            .inE('ownership')
            .as('trail')
            .aggregate('  cypher.path.edge.p')
            .outV()
    )
.emit()
.until(__
        .path()
        .from('base')
        .count(local)
        .is(gte(21))
    )

This is how variable length path <-[trail:ownership*..]- is translated. Basically, it is an imperative way to traverse relationships (21 is limit to avoid infinite loops). If there is a loop, the path will be traversed repeatedly (what happens in the query you've provided).

So we need to find a better implementation of this in Gremlin, but as I've said in the previous message this is a rather complicated task, as there are other edge cases that need to be considered, as well as performance.