Pathfinding/More Expressive Path Queries

mgberg commented 1 year ago

Why?

There are several reasons a built-in pathfinding query capability would be useful. For example:

Property paths only return the start and end nodes, but many times it would be useful to see all the intermediate nodes on the path
Property paths are expressive but there are times when more expressiveness is necessary, e.g. a filter on the intermediate nodes of a variable length path (in this case, the intermediate nodes aren't necessarily important as in the above case)
Pathfinding/shortest path is a very common graph operation.
Previous work

There are several vendors/systems that implement pathfindng/more expressive path queries, and they do so in a number of different ways. For example, some extend the SPARQL query syntax, while others take advantage of SPARQL's federation capabilities to create a special service URI that performs the pathfinding operations.

Stardog has (in my opinion) a fantastic implementation of this but with (at least) one flaw. Their implementation can do both pathfinding and more expressive path queries, and is expressed in a very intuitive and SPARQL-friendly way (again, in my opinion). Paths can be provided by either a single predicate, a property path, or a SPARQL query expression. Start and/or end nodes can either be bound or unbound. The documentation can be found here: https://docs.stardog.com/query-stardog/path-queries

The one flaw that sticks out to me is that the only length metric they support is the number of "hops" in the path. If the path is specified by a SPARQL pattern, it would be nice to be able to project out a third variable specifying the value to use for the "length" of that connection. This would enable it to optionally behave more like pathfinding function in a property graph where the length metric is an edge property. I mentioned it in a post in the Stardog Community here: https://community.stardog.com/t/specify-path-length-in-path-queries/4583

Below is an example copied from the linked documentation above- see the documentation for more examples.

There is an implicit relationship between actors based on the movies they appeared together. We can use a basic graph pattern with multiple triple patterns in the path query to extract this information:

PATHS START ?x = :Kevin_Bacon END ?y = :Robert_Redford VIA { ?movie a :Film ; :starring ?x , ?y }

Note that this query filters the intermediate nodes; only the connections via movies are desired, not TV series or other types of works.

This query might return:

x	movie	y
:Kevin_Bacon	:Apollo_13	:Gary_Sinise
:Gary_Sinise	:Captain_America	:Robert_Redford

:Kevin_Bacon	:Sleepers	:Brad_Pitt
:Brad_Pitt	:Spy_Game	:Robert_Redford

:Kevin_Bacon	:A_Few_Good_Men	:Tom_Cruise
:Tom_Cruise	:Lions_for_Lambs	:Robert_Redford

If the filter on movies is irrelevant, then a more concise version can be used:

PATHS START ?x = :Kevin_Bacon END ?y = :Robert_Redford VIA ^:starring/:starring

Considerations for backward compatibility

The primary issue with backward compatibility I can think of is that it would require extending the SPARQL result format, and likely intermediate query engine internal representations, as this requires storing and returning lists and/or objects of results, or subtables of results, as a first class citizen. If that was supported, then this would be easier. I made an issue about this a while back (#151), but there are other similar feature requests too.

VladimirAlexiev commented 1 year ago

GraphDB's implementation is described at https://graphdb.ontotext.com/documentation/10.3/graph-path-search.html. Rather than new keywords (VIA), it uses existing mechanisms:

an outer SERVICE to call path search
a nested SERVICE to specify the "path step" pattern ("step generator")

Eg the "Kevin Bacon" query looks like this:

SELECT ?edge ?index ?path
WHERE {
    VALUES (?src ?dst) {
        ( dbr:Chris_Evans_\(actor\) dbr:Chris_Hemsworth )
    }
    SERVICE <http://www.ontotext.com/path#search> {
        <urn:path> path:findPath path:allPaths ;
                   path:sourceNode ?src ;
                   path:destinationNode ?dst ;
                   path:pathIndex ?path ;
                   path:minPathLength 2 ;
                   path:startNode ?start;
                   path:resultBinding ?edge ;
                   path:endNode ?end;
                   path:resultBindingIndex ?index .
        SERVICE <urn:path> {
            ?film a dbo:Film .
            ?film dbp:starring ?start .
            ?film dbp:starring ?end .
        }
    }
}

Results are returned as RDF-star triples:

The path can be visualized
The triples are annotated with useful additional info (eg path number in case of multiple paths, and sequential index along the path).
You can project any bindings out of the "step generator".

@mgberg We'll appreciate feedback on whether you like our approach, and ideas for potential improvement. I'll ask my colleagues whether we can add a custom distance function.

mgberg commented 1 year ago

@VladimirAlexiev That approach is fine given the current SPARQL 1.1 specification, but that's just an Ontotext specific feature that takes advantage of federation to inject functionality that SPARQL doesn't have right now. Other vendors and systems also use federation in similar ways to add similar capabilities.

However, one of the reasons SPARQL is so powerful is that all triple stores implement the same query language specification (at a minimum), which is one of the reasons why it is so easy to query across disparate datasets (which may be in different triple store implementations) using SPARQL.

The point of this feature request is explicitly to make pathfinding queries a part of the SPARQL 1.2 specification (or another future version), which would likely involve adding new keywords for simplicity and expressiveness. This would make SPARQL more useful and powerful for everyone, everywhere, on all SPARQL processors compliant with that new version, all the time.

VladimirAlexiev commented 1 year ago

Can you point to and briefly describe the features in other implementations you know?

ktk commented 11 months ago

Linking to #42 which was the first mention of PATH IIRC

ktk commented 11 months ago

Not completely sure if I get the syntax but it looks like AllegroGraph provides at least some PATH like search as well https://franz.com/agraph/support/documentation/current/magic-properties.html#Paths

There are other graph algorithms here: https://franz.com/agraph/support/documentation/current/magic-properties.html#magic-prop-list

ktk commented 11 months ago

There was Cray Urika/Cray Graph Engine, at least before HP bought Cray. From what I could find it looks like this is discontinued by now. Interestingly, they had quite a lot of graph functions in it and it was integrated in SPARQL as well, even though it looks like it was quite heavily extended:

CGE provides an infrastructure for calling graph algorithms from within SPARQL queries. A graph algorithm is called via a CGE-specific SPARQL operator named INVOKE.

It is useful to note the following items:

The INVOKE operator cites the name of the graph algorithm being invoked, using an URI notation that is similar to that used for representing built-in functions in SPARQL.

Scalar arguments can be input to the graph algorithm via a parenthesized argument list.

The INVOKE clause is always preceded by a SPARQL CONSTRUCT clause, whose function in this context is to build the graph that is input to the graph algorithm. CGE provides the capability of nesting a CONSTRUCT/INVOKE clause within a SELECT/WHERE clause. This enables a subquery within a SPARQL query to select or produce a subgraph, which is used as input to the graph algorithm.

The INVOKE clause is immediately followed by a PRODUCING clause, whose function is to bind the results of the graph algorithm to specific SPARQL variables.

While RDF graphs may define many different types of subjects and objects, the CGE graph algorithms treat them all as homogeneous vertices and do not distinguish between them according to type, with the exception of functions that explicitly expect some vertices to be distinguished.

The CONSTRUCT-INVOKE-PRODUCING combination needs to be nested within a SELECT-WHERE clause.

For all CGE-specific built-in graph functions, if the query writer wants to specify a non-default value for an argument, values for the preceding arguments also need to be specified, even if default values for those arguments are to be used.

V3 documentation, chapter Combine Graph Algorithms with SPARQL
V2 documentation as PDF, bit nicer for quick browsing of what features they had. Get it here

ktk commented 11 months ago

IIRC @rvesse worked for Cray, maybe he has some experience to share?

mgberg commented 11 months ago

Not completely sure if I get the syntax but it looks like AllegroGraph provides at least some PATH like search as well https://franz.com/agraph/support/documentation/current/magic-properties.html#Paths

There are other graph algorithms here: https://franz.com/agraph/support/documentation/current/magic-properties.html#magic-prop-list

A quick comment about AllegroGraph's implementation- I've used it, it works. The part that I don't like (as far as an extension to the SPARQL spec is concerned) is that you first have to define a generator, which is the triple pattern used to determine the next set of "neighbors" in the path, as an RDF resource, store it in the triple store, and reference it when using a pathfinding magic property. In other words, to do path queries, you either need to have write access to the database to define whatever generator you need, or hope that one matching your query has already been defined.

One of the reasons I used the Stardog path query syntax as an example is because you provide your predicate or pattern in the query (the VIA clause) and therefore it is open to anyone to use as they wish. Also, that implementation is a magic property instead of a core language feature.

Admittedly the syntax was a bit odd to me when I first saw it, but once I learned how it worked it made a lot of sense and I was very excited by the capability it provided.

rvesse commented 11 months ago

IIRC @rvesse worked for Cray, maybe he has some experience to share?

Yeah it was a giant syntactic kludge to work within the confines of a SPARQL parser. The only path like algorithms we had were simply for calculating the length of the path between two nodes in the graph, we didn't do anything with returning intermediate nodes.

We never actually supported SPARQL 1.1 property paths fully (see CGE Property Path Support) we effectively just flattened simple paths, i.e. fixed length paths, into their equivalent BGPs, and we optionally provided emulation of more complex paths by rewriting them into fixed length expansions of the path, i.e. exploring a path up to N steps where N was configurable. None of our customers ever wanted/needed full property path support.

This issue probably has some relation to #6 because you're effectively wanting to call a "function" that will return multiple values for each input row where that variable length list of values denotes the discovered path

afs commented 11 months ago

+1 to connections to #6.

mgberg commented 9 months ago

As an additional note- this kind of capability would enable a lot of use cases that SPARQL doesn't handle very well involving "recursive" graph traversal operations that are often solved by magic properties or postprocessing the output of a CONSTRUCT query, such as:

In the context of W3C SKOS, finding the first broader concept with a particular attribute
In the context of W3C PROV, finding the most recent Activity that an Entity went through that was performed by a certain person or that resulted in a particular type of generated data
In the context of QUDT, determining the set of applicable units for a quantity kind

w3c / sparql-dev

Pathfinding/More Expressive Path Queries #191

Why?

Previous work

Considerations for backward compatibility