w3c / shacl

SHACL Community Group (Post-REC activitities)
32 stars 5 forks source link

Make sh:prefixes optional in SPARQL queries #59

Open HolgerKnublauch opened 4 months ago

HolgerKnublauch commented 4 months ago

Basically every user of SHACL-SPARQL ever has run into this limitation: Not only is the syntax of sh:prefixes complex to use, the expectation is that the namespace prefixes from the shapes graph apply automatically.

When this was designed, the WG decided to not rely on automatic definition of prefixes because prefixes are not persisted in the RDF Graph data model, but are just a temporary feature in during parsing. This is despite the fact that many APIs such as Jena do persist the prefixes together with the graph object. But yeah, the problem remains that when SHACL-SPARQL shapes are moved around the prefixes may get lost.

A solution here may be to automatically prepend all prefixes from the shapes graph when no sh:prefixes triple exists. We need to come up with some creative solution on how to define what these prefixes must be. But it's a huge obstacle, so we cannot just leave the current solution in place. At least I believe we can define a list of default prefixes such as rdf: owl: and sh: that should always be present.

VladimirAlexiev commented 3 months ago

:+1:

Repo namespaces are not 100% "reliable" because if there's a conflicting prefix, the repo can remember only one of the namespaces. (i.e. if you load file1.ttl that defines foo: and the repo remembers its namespace, then when you load file2.ttl with the same prefix but different namespace, the repo won't remember it).

But if there are such conflicts, the user can override with sh:prefixes

TallTed commented 3 months ago

I do not believe we can, nor should, define such a list. There is no reason why SHACL users should be forbidden from using owl: elsewhere (and perhaps owl-ns: for http://www.w3.org/2002/07/owl#).

This "problem" is not specific to SHACL, and should not be addressed by SHACL-related specs. If anywhere, this "problem" should be addressed in the context of SPARQL, as there would then need to be a "SPARQL prefixes registry" or similar.

At present, there is no IANA nor W3C "prefix registry", which would be necessary to prevent collisions. The closest thing of which I'm aware is the prefix.cc lookup service — which does not prevent collisions!

This is despite the fact that many APIs such as Jena do persist the prefixes together with the graph object.

Those APIs are acting outside of all prefixed-name specifications of which I am aware, which specify that the prefixes be declared in the same document in which they are used. I certainly hope that Jena and any other software that acts this way treats prefix declarations they encounter as overriding their persisted prefixes in the context of the live declaration.

How does Jena handle a Turtle document that includes multiple declarations of the same prefix? Which declaration does Jena persist, and for how long? Does Jena persist the first declaration it encounters for any given prefix, over-riding later declarations in the same Turtle document? Or does Jena persist the "last declaration from the first document that contains a declaration for that prefix"?

Note that users have the option of setting any prefix they like in a given document, and may use the same prefix with multiple expansions in a single Turtle document, among other places. Note that this is explicitly permitted by the Turtle spec, and each declaration is active for the prefixed names following that declaration and preceding any other declaration of the same prefix.

rdf: could be used just as well for http://example.org/rough-data-format# as for http://www.w3.org/1999/02/22-rdf-syntax-ns# or https://cacax.fun/.

sh: could be http://shell.example# and/or http://example.sh# as well as http://www.w3.org/ns/shacl# or http://purl.org/skos-history/.

At least I believe we can define a list of default prefixes such as rdf:, owl: and sh: that should always be present.

Such a requirement would conflict with the rest of the universe of RDF specifications, even if nowhere else — but I believe it would conflict with many other specs as well.

HolgerKnublauch commented 3 months ago

Whatever solution we want to implement here, SOMETHING has to be done. The current syntax of SHACL with the sh:prefixes has not passed the test of time and basically every user has problems with this. This is an obstacle to wider adoption of SHACL-SPARQL. We can find many reasons not to do something, yet we should have an open discussion with pragmatics as one of the main drivers.

VladimirAlexiev commented 3 months ago
HolgerKnublauch commented 3 months ago

Yes, or they can put explicit PREFIX declarations into the sh:select string. The defaults would only serve as fallback.

TallTed commented 3 months ago

@VladimirAlexiev — Whether "[remembering] a prefix the first time they see it" is "a great usability feature" depends greatly on whether that first-use is optimal for the users of that deployment. Indeed, given that, for instance, Turtle mandates that the latest declaration of a given prefix cover any given occurrence within the instance data, "Only the first occurrence is remembered" seems like a great usability negative. At the least, there should be some way to tell such an "auto-remembering" system that this declaration should replace a previously remembered declaration and/or to forget all previously remembered declarations.

Indeed, unless there's some point of user confirmation, "the SPARQL editor can auto-insert such prefix when used in a query" is likely to lead to confusing results that may not match results on any other SPARQL processor, including other deployments of the same software with the same loaded data, just because user queries were run in a different order following data load (i.e., Query_1 on Server_1 uses prefix sh: with one namespace, while Query_2 on Server_1 uses prefix sh: with a different namespace, and these two queries are run in reverse order on Server_2).

Where do you learn what prefix declarations were used for a given query?

This seems like a pure landmine to me.

fwiw, Virtuoso has a table of prefix/namespace registrations (visible here for the DBpedia instance). These can be set for use in exports to media types that support prefixed names (such as N3 or Turtle), and to be a fallback for a SPARQL query that doesn't include one or more declarations for prefixed names found in that query. Declarations that occur in a query over-ride those in the table, when the same prefix is found in both places. There are rules in SPARQL and Turtle (among other places) that govern handling of duplicate prefixes that are declared with different namespaces within the same query or document.

I know that these stored namespaces can lead to user confusion because they have. This is another downside to the optimistic SPARQL interaction that lacks any useful way to report errors or communicate other things (such as predefined namespace prefixes that it applied to execution of a given query) to the user outside of HTTP headers (which many users never see). We provide the facility to predefine declarations because of user demand; I have hopes that this feature will be improved over time to decrease such user confusion.