w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
123 stars 19 forks source link

Pathfinding/More Expressive Path Queries #191

Open mgberg opened 1 year ago

mgberg commented 1 year ago

Why?

There are several reasons a built-in pathfinding query capability would be useful. For example:

There are several vendors/systems that implement pathfindng/more expressive path queries, and they do so in a number of different ways. For example, some extend the SPARQL query syntax, while others take advantage of SPARQL's federation capabilities to create a special service URI that performs the pathfinding operations.

Stardog has (in my opinion) a fantastic implementation of this but with (at least) one flaw. Their implementation can do both pathfinding and more expressive path queries, and is expressed in a very intuitive and SPARQL-friendly way (again, in my opinion). Paths can be provided by either a single predicate, a property path, or a SPARQL query expression. Start and/or end nodes can either be bound or unbound. The documentation can be found here: https://docs.stardog.com/query-stardog/path-queries

The one flaw that sticks out to me is that the only length metric they support is the number of "hops" in the path. If the path is specified by a SPARQL pattern, it would be nice to be able to project out a third variable specifying the value to use for the "length" of that connection. This would enable it to optionally behave more like pathfinding function in a property graph where the length metric is an edge property. I mentioned it in a post in the Stardog Community here: https://community.stardog.com/t/specify-path-length-in-path-queries/4583

Below is an example copied from the linked documentation above- see the documentation for more examples.

There is an implicit relationship between actors based on the movies they appeared together. We can use a basic graph pattern with multiple triple patterns in the path query to extract this information:

PATHS START ?x = :Kevin_Bacon END ?y = :Robert_Redford VIA { ?movie a :Film ; :starring ?x , ?y }

Note that this query filters the intermediate nodes; only the connections via movies are desired, not TV series or other types of works.

This query might return:

x movie y
:Kevin_Bacon :Apollo_13 :Gary_Sinise
:Gary_Sinise :Captain_America :Robert_Redford
:Kevin_Bacon :Sleepers :Brad_Pitt
:Brad_Pitt :Spy_Game :Robert_Redford
:Kevin_Bacon :A_Few_Good_Men :Tom_Cruise
:Tom_Cruise :Lions_for_Lambs :Robert_Redford

If the filter on movies is irrelevant, then a more concise version can be used:

PATHS START ?x = :Kevin_Bacon END ?y = :Robert_Redford VIA ^:starring/:starring

Considerations for backward compatibility

The primary issue with backward compatibility I can think of is that it would require extending the SPARQL result format, and likely intermediate query engine internal representations, as this requires storing and returning lists and/or objects of results, or subtables of results, as a first class citizen. If that was supported, then this would be easier. I made an issue about this a while back (#151), but there are other similar feature requests too.

VladimirAlexiev commented 1 year ago

GraphDB's implementation is described at https://graphdb.ontotext.com/documentation/10.3/graph-path-search.html. Rather than new keywords (VIA), it uses existing mechanisms:

Eg the "Kevin Bacon" query looks like this:

SELECT ?edge ?index ?path
WHERE {
    VALUES (?src ?dst) {
        ( dbr:Chris_Evans_\(actor\) dbr:Chris_Hemsworth )
    }
    SERVICE <http://www.ontotext.com/path#search> {
        <urn:path> path:findPath path:allPaths ;
                   path:sourceNode ?src ;
                   path:destinationNode ?dst ;
                   path:pathIndex ?path ;
                   path:minPathLength 2 ;
                   path:startNode ?start;
                   path:resultBinding ?edge ;
                   path:endNode ?end;
                   path:resultBindingIndex ?index .
        SERVICE <urn:path> {
            ?film a dbo:Film .
            ?film dbp:starring ?start .
            ?film dbp:starring ?end .
        }
    }
}

Results are returned as RDF-star triples:

@mgberg We'll appreciate feedback on whether you like our approach, and ideas for potential improvement. I'll ask my colleagues whether we can add a custom distance function.

mgberg commented 1 year ago

@VladimirAlexiev That approach is fine given the current SPARQL 1.1 specification, but that's just an Ontotext specific feature that takes advantage of federation to inject functionality that SPARQL doesn't have right now. Other vendors and systems also use federation in similar ways to add similar capabilities.

However, one of the reasons SPARQL is so powerful is that all triple stores implement the same query language specification (at a minimum), which is one of the reasons why it is so easy to query across disparate datasets (which may be in different triple store implementations) using SPARQL.

The point of this feature request is explicitly to make pathfinding queries a part of the SPARQL 1.2 specification (or another future version), which would likely involve adding new keywords for simplicity and expressiveness. This would make SPARQL more useful and powerful for everyone, everywhere, on all SPARQL processors compliant with that new version, all the time.

VladimirAlexiev commented 1 year ago

Can you point to and briefly describe the features in other implementations you know?

ktk commented 11 months ago

Linking to #42 which was the first mention of PATH IIRC

ktk commented 11 months ago

Not completely sure if I get the syntax but it looks like AllegroGraph provides at least some PATH like search as well https://franz.com/agraph/support/documentation/current/magic-properties.html#Paths

There are other graph algorithms here: https://franz.com/agraph/support/documentation/current/magic-properties.html#magic-prop-list

ktk commented 11 months ago

There was Cray Urika/Cray Graph Engine, at least before HP bought Cray. From what I could find it looks like this is discontinued by now. Interestingly, they had quite a lot of graph functions in it and it was integrated in SPARQL as well, even though it looks like it was quite heavily extended:

CGE provides an infrastructure for calling graph algorithms from within SPARQL queries. A graph algorithm is called via a CGE-specific SPARQL operator named INVOKE.

It is useful to note the following items:

  • The INVOKE operator cites the name of the graph algorithm being invoked, using an URI notation that is similar to that used for representing built-in functions in SPARQL.
  • Scalar arguments can be input to the graph algorithm via a parenthesized argument list.
  • The INVOKE clause is always preceded by a SPARQL CONSTRUCT clause, whose function in this context is to build the graph that is input to the graph algorithm. CGE provides the capability of nesting a CONSTRUCT/INVOKE clause within a SELECT/WHERE clause. This enables a subquery within a SPARQL query to select or produce a subgraph, which is used as input to the graph algorithm.
  • The INVOKE clause is immediately followed by a PRODUCING clause, whose function is to bind the results of the graph algorithm to specific SPARQL variables.
  • While RDF graphs may define many different types of subjects and objects, the CGE graph algorithms treat them all as homogeneous vertices and do not distinguish between them according to type, with the exception of functions that explicitly expect some vertices to be distinguished.
  • The CONSTRUCT-INVOKE-PRODUCING combination needs to be nested within a SELECT-WHERE clause.
  • For all CGE-specific built-in graph functions, if the query writer wants to specify a non-default value for an argument, values for the preceding arguments also need to be specified, even if default values for those arguments are to be used.
ktk commented 11 months ago

IIRC @rvesse worked for Cray, maybe he has some experience to share?

mgberg commented 11 months ago

Not completely sure if I get the syntax but it looks like AllegroGraph provides at least some PATH like search as well https://franz.com/agraph/support/documentation/current/magic-properties.html#Paths

There are other graph algorithms here: https://franz.com/agraph/support/documentation/current/magic-properties.html#magic-prop-list

A quick comment about AllegroGraph's implementation- I've used it, it works. The part that I don't like (as far as an extension to the SPARQL spec is concerned) is that you first have to define a generator, which is the triple pattern used to determine the next set of "neighbors" in the path, as an RDF resource, store it in the triple store, and reference it when using a pathfinding magic property. In other words, to do path queries, you either need to have write access to the database to define whatever generator you need, or hope that one matching your query has already been defined.

One of the reasons I used the Stardog path query syntax as an example is because you provide your predicate or pattern in the query (the VIA clause) and therefore it is open to anyone to use as they wish. Also, that implementation is a magic property instead of a core language feature.

Admittedly the syntax was a bit odd to me when I first saw it, but once I learned how it worked it made a lot of sense and I was very excited by the capability it provided.

rvesse commented 11 months ago

IIRC @rvesse worked for Cray, maybe he has some experience to share?

Yeah it was a giant syntactic kludge to work within the confines of a SPARQL parser. The only path like algorithms we had were simply for calculating the length of the path between two nodes in the graph, we didn't do anything with returning intermediate nodes.

We never actually supported SPARQL 1.1 property paths fully (see CGE Property Path Support) we effectively just flattened simple paths, i.e. fixed length paths, into their equivalent BGPs, and we optionally provided emulation of more complex paths by rewriting them into fixed length expansions of the path, i.e. exploring a path up to N steps where N was configurable. None of our customers ever wanted/needed full property path support.

This issue probably has some relation to #6 because you're effectively wanting to call a "function" that will return multiple values for each input row where that variable length list of values denotes the discovered path

afs commented 11 months ago

+1 to connections to #6.

mgberg commented 9 months ago

As an additional note- this kind of capability would enable a lot of use cases that SPARQL doesn't handle very well involving "recursive" graph traversal operations that are often solved by magic properties or postprocessing the output of a CONSTRUCT query, such as: