Blank node behavior when using SPARQL

hsolbrig commented 3 years ago

At the moment, different ShEx implementations exhibit different behaviors when crossing BNodes in SPARQL.

PyShEx has three options: 1) Throw an error when attempting to submit a SPARQL query with a BNode subject or object 2) Assume that the SPARQL endpoint maintains persistent BNodes (which may cause a hang / timeout if not true) 3) Take advantage of the GraphDB specific solution

Shex.js only implements option 2)

Not sure on other implementations

Do we want to specify a consistent behavior across interpreters? If so, what should that be?

gkellogg commented 3 years ago

I've always voiced the opinion that it should be illegal to use a blank node as a ShEx starting point, as in RDF, there is no expectation that one used in a serialization will be maintained within a datastore; I think it should be illegal. This is the use case skolem IDs were created for, although I'm not a great fan of those, either.

Better to use a query to identify a starting node, where the query would result in the desired node.

ericprud commented 3 years ago

I think that's a separate issue though; this is about how you practically re-visit a bnode you got in response to a previous query. This is an issue for remote faceted browsing, ShEx validation, and anyone else iteratively querying a SPARQL endpoint.

ericprud commented 3 years ago

I'm currently adding both arrival path and disambiguation code in the ShEx.js SPARQL interface. This allows it to:

remember how it got to any bnode
distinguish all of the visited bnodes from each other.

Wikidata (augmented) example:

wd:Q313093 <P999> _:a .
_:a
  # works
  <P2860> _:a ; # apparently, a bare blank node stands for unknown value
  # advisors
  <P184> wd:Q123 , _:1e_____ , _:xe_____ , _:ye_____ , _:1cd__2g , _:1cd__2f , _:1cdef2g , _:1cdef2f .

# advisors (mostly bnodes to exercise disambiguator)
wd:Q123                                                         <P735> "a" , "b" .
_:1e_____ <P000> wd:Qe                                        ; <P735> "abc" .
_:xe_____ <P000> wd:Qe                                        ; <P735> "abc" .
_:ye_____ <P000> wd:Qe                                        ; <P735> "abc" .
_:1cd__2g <P000> wd:Qc , wd:Qd                 ; <P001> wd:Qg ; <P735> "abc" .
_:1cd__2f <P000> wd:Qc , wd:Qd                 ; <P001> wd:Qf ; <P735> "abc" .
_:1cdef2g <P000> wd:Qc , wd:Qd , wd:Qe , wd:Qf ; <P001> wd:Qg ; <P735> "abc" .
_:1cdef2f <P000> wd:Qc , wd:Qd , wd:Qe , wd:Qf ; <P001> wd:Qf ; <P735> "abc" .

The data structure is (JSON liberalized to include RDF terms) to identify e.g. _:1cdef2g is

{ start: wd:Q313093, path: [
  {p:<P999>}, # no ambiguity
  {p:<P184>, unique: {
     <P000>: [wd:Qc, wd:Qd],
    <P001> = [wd:Qg]
   }
]

which allows you to select for _:1cdef2g ?p ?o like:

SELECT ?1 ?p ?o WHERE {
  wd:Q313093 <P999> ?0 . # no ambiguity
  ?0 <P184> ?1 .
  ?1 <P000> wd:Qc , wd:Qd . ?1 <P001> wd:Qg . # disambiguate
 FILTER NOT EXISTS {?1 <P000> ?2 FILTER (NOT (?2 IN (wd:Qc, wd:Qg)) }
  ?1 ?p ?o
}

_:1e_____, _:xe_____ , and _:ye_____ are provably interchangeable so the data structure for the former needs to indidate that it's serving for three:

{ start: wd:Q313093, path: [
  {p:<P999>}, # no ambiguity
  {p:<P184>, unique: {
     <P001> = [wd:Qe]
    }, proxies: [ _:xe_____ , _:ye_____ ]
  }
]

and _:xe_____ , and _:ye_____ simply execute the query for _:1e_____.

I haven't tested for corefs, which would be another way to disambiguate AND might prove that 1e, xe and ye aren't all interchangable, but we'd only have to do those tests iff the schema included inverse arcs in the right places.

shexSpec / shex

Blank node behavior when using SPARQL #109