w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
121 stars 19 forks source link

Ability to use default PREFIX values #70

Open tfrancart opened 5 years ago

tfrancart commented 5 years ago

Why?

I am tired of writing / copy-pasting every prefix each time I need to write a query. The prefix mechanism makes the SPARQL learning curve steeper.

Previous work

Proposed solution

Considerations for backward compatibility

Queries without prefixes would now be valid, while they are currently not. E.g.

SELECT * WHERE { ?x a skos:Concept }
rubensworks commented 5 years ago

It may also be useful to expose all of these preconfigured prefixes in the SPARQL service description.

ktk commented 5 years ago

I agree on the idea, we have to be careful with implementation. Prefix.cc has a very polluted namespace, apparently some bots added crap there for a while. Also I do not see any way to get rid of those, there is zero reporting from what I could see.

I once defined my own prefixes exactly for that reason, which are based on RDFa initial contexts and extends them via kind of a semantic versioning approach, see the GitHub repo for it. It is a 1:1 mapping, which is not the initial idea of prefixes.

cygri commented 5 years ago
  1. The rdf: and xsd: namespaces are baked into the mechanics of SPARQL, and are needed in most queries. Not predefining them is just bullying the user.
  2. I agree that endpoints should be free to apply an implicit prefix mapping that can be overridden explicitly using PREFIX declarations in the query. A few implementations do that already on the SPARQL protocol level (e.g., Virtuoso), and it is very common in SPARQL clients that allow users to enter queries. I agree it would be good if the implicit prefixes were discoverable through the service description.
  3. Instead of starting every query with BASE and PREFIX declarations (the “prologue” as it's called in the grammar), there could be a keyword that refers to an external prologue:
    PROLOGUE <myproject.ttl>
    SELECT ... { ... }

    The prologue file could just be the BASE and PREFIX declarations at the beginning of a Turtle file; read it up to the first triple and then stop. It could be a local file or an HTTP URL. One useful pattern would be:

    PROLOGUE <data.ttl>
    SELECT *
    FROM <data.ttl>
    WHERE ...

    This would use the prefixes and base IRI of the data file, removing the necessity to copy them over. Or use this one—2000 prefixes ready to go:

    PROLOGUE <http://prefix.cc/popular/all.file.ttl>

    Or if that's too messy, take just the ones you want:

    PROLOGUE <http://prefix.cc/rdf,rdfs,xsd,owl,skos.file.ttl>

    Obviously one should still be able to use PREFIX in the query itself to add more prefixes or to override those from the external prologue.

  4. I don't think there is a need for a central list/repository baked into SPARQL. With this prologue mechanism, endpoint operators and query authors can use project-specific lists of mappings, or can use a comprehensive list from a repositories such as prefix.cc or LOV if they want. A hardcoded central list would just add an unnecessary gatekeeper.

@ktk The way to deal with spam at prefix.cc is to downvote it—stuff with a bad vote ratio gets removed automatically. There never was a bot attack—the spam is hand-inserted and it's just one link a week or so on average. To the best of my knowledge, no one ever managed to hijack a popular prefix. Everything has rel=nofollow anyway so the spammer gains nothing.

gkellogg commented 5 years ago

I think the idea of a default context for things like PREFIX definitions is likely to hit other parts of the eco-system. JSON-LD does not pull in a default context, as it's easy to specify this using the @context notation; RDFa did it, as it was a common source of errors. Note that as a best-practice, JSON-LD is creating a Recommended Context, based on the RDFa initial context, with some RDFa-specific indexes removed. CSVW also defines prefixes based on the RDFa context

jindrichmynarz commented 5 years ago

Here we trade convenience for self-contained queries. Many SPARQL editors can auto-complete namespace prefixes, alleviating the usability pains. I think this is more of a tooling issue, than a SPARQL specification issue.

kasei commented 5 years ago

@jindrichmynarz Agreed this might not require spec changes.

It seems clear that some deployments (e.g. Wikidata) have an interest in doing this server-side, though. Given that, I think we should at least be looking at best practices for how to communicate to a client which prefix definitions a server is using. That might take the form of an agreed upon vocabulary to use in the service description (and corresponding implementation in multiple endpoints).

Even if we just wanted this to be a client-side tooling issue (without pre-defined namespaces), it might be valuable for endpoints to provide auto-complete information for namespace definitions to the clients for domain-specific namespaces that might not appear in a repository such as prefix.cc.

As an implementor of both endpoints and a SPARQL query editor, I think both sides of this approach (prefixes in SD and client-side support for auto-completion of prefix declarations) would be really useful!

cygri commented 5 years ago

@jindrichmynarz More tool support would be great, but for a tool to help with prefix management, it first needs to know about the prefixes. So how is a tool supposed to learn about the prefixes?

Look them up in prefix.cc? Sure, but that doesn't contain private and project-specific prefixes, which outnumber the prefixes of well-known vocabularies in most queries (at least in industry).

Some tools allow configuration of custom prefix mappings for autocomplete, but in what format? A JSON-LD context? An RDF file using the VANN vocabulary? A file with SPARQL/Turtle-style PREFIX statements? A clone of the prefix.cc API on a different/private URL? Some proprietary format?

The SPARQL server usually already knows the prefixes, so why can't it communicate them to the SPARQL client?

What happens in practice is that the SPARQL server operator sticks the prefixes manually into prefix.cc, and the client fetches them from there. So now SPARQL protocol server and SPARQL protocol client trade prefix information not through their shared protocol, but through an out-of-protocol single point of failure with a proprietary API. And I am paying for the bandwidth!

jindrichmynarz commented 5 years ago

I'm all for defining a way to include pre-defined namespace prefixes in SPARQL 1.1. Service Description.

TallTed commented 5 years ago

I think that server-supplied namespace definitions ought to be a required part of the query output (something like "you used fred: without a prefix declaration; we used <http://fred.example/#> which we got from http://prefix.cc/fred") -- which would currently be problematic in many response structures, just like deliveries of incomplete results (for whatever reason they're incomplete).

I have related concerns about client-tool-supplied pre-defined namespace definitions that are not clearly and blatantly visible to the user.

The reason is simple: "default" or "generally preferred" expansions of given prefixes have, and will, change over time, and there is no way to assure that the expansion used when I ran a query yesterday is still in effect for today's execution of the "same" query (which is not the same, once the namespace associated with that prefix changes). Similarly, there is no way to know whether someone (let's presume a new user) intends their xsd: to expand to http://example.com/xray-standard-doc/# (commonly used by the team they just joined) or http://www.w3.org/2001/XMLSchema# (commonly used by most of the world), unless they explicitly declare that intention.

Just as Turtle documents are properly considered malformed if they omit declarations of any prefixes used therein, SPARQL queries should only be considered well-formed and self-contained if they include declarations for all prefixes used therein. Reliance on any external item -- whether server-supplied definitions, client-app-supplied definitions, a file full of declarations retrieved by dereferencing an in-query URI -- renders that query no longer self-contained.

(Client tools that handle this for the user, upon user opt-in, with proper inclusion of the declaration in the final, visible SPARQL query, are entirely permissible. This would include browser-based query submission forms and/or server pre-processors that flag undeclared prefixes and interact with the user to confirm their intended meaning.)

dbooth-boston commented 5 years ago

Reading prefixes from a file would be nice, because that is under the control of the developer doing the queries. But I think it would be unwise to leave the defaults to the SPARQL server or prefix.cc, because those could change if a different SPARQL server is swapped in, or if prefix popularities change, as @TallTed pointed out.

gkellogg commented 5 years ago

I typically create cached versions of popular JSON-LD contexts which helps performance quite a bit, and helps to mollify the people running servers. For example, schema.org hosts a context, and it's quite a burden if every time a JSON-LD file is processed that the context is downloaded anew. Of course, there's HTTP caching, but this isn't always honored.

We've also considered alternative URL schemes (such as hash links that allows data to be either accessed from multiple locations, or provided out-of-band.

cygri commented 5 years ago

@TallTed

Just as Turtle documents are properly considered malformed if they omit declarations of any prefixes used therein, SPARQL queries should only be considered well-formed and self-contained if they include declarations for all prefixes used therein. Reliance on any external item […] renders that query no longer self-contained.

Among the many standard RDF syntaxes, Turtle (and its subsets and derivates) is the only one that requires documents to be self-contained. RDF/XML has external entities. RDFa has the initial context, which processors are permitted to obtain by resolving its URL. JSON-LD has @context. Among RDF-related languages with specialised syntax, not being self-contained is also the norm: OWL/XML has <Imports>. SHACLC has IMPORTS. ShExC has IMPORT. All these languages benefit from the ability to factor out parts of the content into external files. I don't see what's so special about SPARQL that it cannot benefit from such an ability.

Obviously, whenever one refers to something via URL, there is a risk that what is being returned by the URL changes in undesired ways. I believe that problem recently had its 30th birthday. URL publishers commit to a certain level of stability. Consumers evaluate the trustworthiness of such commitments. If it's not good enough, they don't need to use the URL. The use of PROLOGUE is optional in a query.

TallTed commented 5 years ago

@cygri - I believe all of your examples have the dependent (partial) document declaring the external (partial) document which content is part of the "complete" document made by cobbling the partials together. The processor of a given file is not left to its own devices to just fill in whatever it likes.

The current situation which amounts to "handling of undeclared namespaces is undefined" is problematic, not least because there's no standard way to learn how those undeclared namespaces were expanded by the query servicer.

I'm OK (albeit not thrilled) IFF such external input to SPARQL is to be done via URI ... perhaps with a new PREFIXIMPORT or IMPORT or even PROLOGUE prologue entry, where the result of dereferencing such URI must be BASE and/or PREFIX entries, which are then handled as if they were part of the SPARQL query, with latest-occurring PREFIX or BASE having primacy (so the new IMPORT should be mandated to precede BASE and/or PREFIX entries within the query). Query input forms could have an input field specifically for such a URI, which services could pre-populate. Result URIs could include &prefiximport= so it's clear that such was auto-populated (if it's decided that such is allowed).

(This reminds me of an issue which I had thought was addressed via SPARQL errata, but upon checking just now, apparently not. 19.5 IRI References says "A prefix declared with the PREFIX keyword may not be re-declared in the same query." This is not mentioned in the basic Grammar, neither in notes, nor in the Prologue EBNF area. If enforced, it means that a SPARQL processor cannot simply pre-pend their predefined PREFIX list to a query; nor can a PREFIXIMPORT precede normal PREFIX declaration that partially over-ride that import -- because multiple declarations of, e.g., myprefix1: are a syntax violation.)

cygri commented 5 years ago

@TallTed Yes, that's how PROLOGUE could work, and the ability to override imported prefixes locally would indeed be important. (And I agree that the prohibition of redeclaring a prefix is not optimal. Another deviation from Turtle for no benefit. And kind of odd as the base IRI can be redeclared. The prohibition could be lifted in SPARQL 1.2 without breaking compatibility, as it would become more permissive.)

Regarding the other case, where the query does not specify the prefix mapping at all, but relies on the server to implicitly fill in the prefix mapping. You proposed that the server should include its prefix mapping in the query response, so that the client can check whether its expected mappings were used. Can you explain what advantage this has over making the prefix mapping discoverable via the SPARQL Protocol (for example by describing them in the Service Description document as @rubensworks, @kasei and @jindrichmynarz proposed)? That way, clients can not only check whether the server's prefix mapping matches expectations, but can also retrieve the mapping ahead of time to help with autocomplete etc.

afs commented 5 years ago

@TallTed "handling of undeclared namespaces is undefined" -- What text are you taking that meaning from?

It's an error in 4.1.1.1 because the prefixed name can't be processed.

pchampin commented 5 years ago

@cygri

Obviously, whenever one refers to something via URL, there is a risk that what is being returned by the URL changes in undesired ways.

For the record, the JSON-LD WG has considered adding metadata to context references (w3c/json-ld-syntax#108), but due to lack of time, this won't land in JSON-LD 1.1 .

This kind of mechanism help, at least, to detect that something has changed. And possibly (with protocols such as memento RFC 7089, for example) to ensure that the original data is retrieved.

It could be useful to plan ahead the corresponding feature in SPARQL, e.g. allow for

PROLOGUE <file.ttl> (hash: sha256-a1b2c3d4)
lisp commented 5 years ago

if one considers a PROLOGUE form, is there reason not to consider a general inclusion mechanism rather than limiting it to prefix definitions?

JervenBolleman commented 5 years ago

We should also look at the ShACL prefix logic as an inspiration.

Perhaps like this.

PREFIXES <https://sparql.uniprot.org/sparql>
SELECT ?protein
WHERE {
     ?protein a up:Protein
}

Would execute

SELECT ?prefix ?namespace
WHERE 
{
  [] <http://www.w3.org/ns/shacl#prefix> ?prefix ;
    <http://www.w3.org/ns/shacl#namespace> ?namespace .
}

at that endpoint and use the definitions inside the query.

With an option for PREFIXES LOCAL to make it even easier.

lisp commented 5 years ago

does

Would execute ... at that endpoint and use the definitions inside the query.

intend that the processor reiterate a probe to that endpoint for each query it processes?

JervenBolleman commented 5 years ago

@lisp my first thought is that it would follow http cache headers for re-probing.

VladimirAlexiev commented 1 month ago

I think a SPARQL database (backend) should not use implicit prefixes, but SPARQL query editors should auto-insert prefixes:

Note that many databases (including GraphDB) automatically add newly encountered prefixes to their namespaces.

afs commented 1 month ago

Jena 5.1.0 adds an optional "prefixes" endpoint for datasets in two forms - "read" to lookup a prefix or URI or to retrieve all prefixes and "write" which provides modification of the set of prefixes. The use case is for graph browser or editor support.

https://jena.apache.org/documentation/fuseki2/prefixes-service