w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
121 stars 19 forks source link

Entity-based Construct Queries #128

Open Aklakan opened 3 years ago

Aklakan commented 3 years ago

(Apologies for re-opening #127 as a fresh issue, but in my attempts to clarify the initial proposal I got lost in considerations of technical details and corner cases so that by now my feeling is that I turned it into an incomprehensible mess from which it is no longer possible to judge whether the core idea is of interest or not to the community)

Why?

Consider this analogy: OWL as a description logic language is entity centric: A class expression intensionally describes a set of entities satisfying a given set of constraints. This is akin to a SPARQL SELECT query with a single variable. However, in both cases it is not possible to specify in the same query a corresponding RDF graph fragment for these entities.

In contrast, SPARQL construct queries are triple-centric. Yet, while it is possible to specify the RDF graphs to create from retrieval, it is not possible to specify in a standard way to which entities the triples belong.

As a consequence, so far it is not possible to have a SPARQL query that semantically describes a set of 'objects' - i.e. a 'thing with an id' together with an RDF graph fragment that describes it.

SPARQL is not a graph traversal language and this proposal is not about making it one, but having a standard way to designate entities together with their graph fragment would foremost provide a direct connection point for other path-based languages / approaches such as LDPath or LDFlex.

An example, consider this use case: "From the SPARQL endpoint of scholarlydata retrieve the first 100 publications together with all authors ordered by the name of the first author".

At present, to the best of my knowledge, the query would have to look like this:

PREFIX  eg:   <http://www.example.org/>
PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  dct: <http://purl.org/dc/terms/>
PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX  conf: <https://w3id.org/scholarlydata/ontology/conference-ontology.owl#>
PREFIX  bibo: <http://purl.org/ontology/bibo/>
PREFIX  foaf: <http://xmlns.com/foaf/0.1/>

CONSTRUCT {
  ?pub
    rdfs:label ?label ;
    dct:creator ?content ;
    eg:sortKey ?firstAuthorName .

  ?content foaf:name ?name .
} {
SELECT DISTINCT  ?pub ?label ?list ?firstAuthorName ?content ?name
WHERE
  { { SELECT  ?pub (MIN(str(?firstAuthorName)) AS ?sortKey_1)
      WHERE
        { ?pub  rdf:type         conf:InProceedings ;
                rdfs:label       ?label ;
                bibo:authorList  ?list .
          ?list (conf:hasFirstItem/conf:hasContent)/foaf:name ?firstAuthorName .
          ?list conf:hasItem/conf:hasContent ?content .
          ?content  foaf:name  ?name
        }
      GROUP BY ?pub
      ORDER BY ASC(MIN(str(?firstAuthorName)))
      OFFSET  50
      LIMIT   100
    }
    ?pub  rdf:type         conf:InProceedings ;
          rdfs:label       ?label ;
          bibo:authorList  ?list .
    ?list (conf:hasFirstItem/conf:hasContent)/foaf:name ?firstAuthorName .
    ?list conf:hasItem/conf:hasContent ?content .
    ?content  foaf:name  ?name
  }
ORDER BY ASC(?sortKey_1) ?pub
}

with the response

ns1:iswc-2019-demo-550  ns4:sortKey "Ahmad Sakor" .
ns1:iswc-2019-doctoral-419  rdfs:label  "Fine-grained Entity Type Inference in RDF Knowledge Graphs" ;
    ns2:creator ns3:a-b-m-moniruzzaman ;
    ns4:sortKey "A B M Moniruzzaman" .
ns1:iswc-2019-poster-479    rdfs:label  "An Overview of the TBFY Knowledge Graph for Public Procurement" ;
    ns2:creator ns3:philip-turk ,
        ns3:oscar-corcho ,
        ns3:dumitru-roman ,
        ns3:elena-simperl ,
        ns3:ahmet-soylu ,
        ns3:chris-taggart ,
        ns3:ian-makgill ,
        ns3:till-c-lech ;
    ns4:sortKey "Ahmet Soylu" .
ns1:iswc-2019-research-208  rdfs:label  "Incorporating Literals into Knowledge Graph Embeddings" ;
    ns2:creator ns3:mohammad-asif-khan ,

The response now is just a bunch of triples. Let's assume an application should display an html template <span>{{resource.label}}</span> - which resource should it match? The query response does not tell it where to start. Of course the application could just pick any resource with a label - but what if authors and publications both have them? So we would add a special type to CONSTRUCT query so that the application can pick it up. Now the application that should just display a title of something passed to it needs to be aware of ontology metadata. (And of course, the application by the consortium member uses the same approach using a different class)

Previous work

Custom solutions involving query transformations, use of vocabularies to annotate resources in the construct template, and post processing of SPARQL query responses.

Proposed solution

The proposal comprises three things:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX conf: <https://w3id.org/scholarlydata/ontology/conference-ontology.owl#>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX eg: <http://www.example.org/>

ENTITY ?pub
CONSTRUCT {
  ?pub
    rdfs:label ?label ;
    dct:creator ?content ;
    eg:sortKey ?firstAuthorName .

  ?content foaf:name ?name .
}
WHERE {
  ?pub
    a conf:InProceedings ;
    rdfs:label ?label ;
    bibo:authorList ?list .

  ?list
    conf:hasFirstItem/conf:hasContent/foaf:name ?firstAuthorName ;
    conf:hasItem/conf:hasContent ?content .

  ?content foaf:name ?name .
}
PARTITION BY ?pub
ORDER PARTITIONS BY ASC(MIN(?firstAuthorName))
LIMIT 100
OFFSET 50

The response for this query is a sequence of partitions in the order as specified in the query. The partitions could be represented as named graphs with random-generated IRIs on query execution and thus most unlikely to clash with any data in the payload. A to-be-standardized property attached to this named graph IRI could be used to state which entities within that graph were declared to act as starting points. The specification could guarantee that partitions are always exposed as consecutive quads. A change in the named graph IRI thus marks the end of a partition. As a result, path based approaches could directly 'connect' to the designated entities and traverse the data in of the partition's graph fragments.

<urn:sparql-partition:58ryRBAb4Wh92LLn-4TLwyTTVaNvMaCBzA4aLUvLlk4=-0> {
    <urn:sparql-partition:58ryRBAb4Wh92LLn-4TLwyTTVaNvMaCBzA4aLUvLlk4=-0>
            <http://NEEDS_STANDARDIZATION/hasEntity>  <https://w3id.org/scholarlydata/inproceedings/eswc2009/paper/181> .
    <https://w3id.org/scholarlydata/person/philippe-cudre-mauroux>
            <http://xmlns.com/foaf/0.1/name>  "Philippe Cudre Mauroux" ;
            <http://xmlns.com/foaf/0.1/name>  "Philippe Cudré-Mauroux" ;
            <http://xmlns.com/foaf/0.1/name>  "Philippe Cudre-Mauroux" .
    <https://w3id.org/scholarlydata/person/sebastian-michel>
            <http://xmlns.com/foaf/0.1/name>  "Sebastian Michel" .
    <https://w3id.org/scholarlydata/inproceedings/eswc2009/paper/181>
            <http://purl.org/dc/terms/creator>  <https://w3id.org/scholarlydata/person/sebastian-michel> ;
            <http://purl.org/dc/terms/creator>  <https://w3id.org/scholarlydata/person/adriana-budura> ;
            <http://purl.org/dc/terms/creator>  <https://w3id.org/scholarlydata/person/philippe-cudre-mauroux> ;
            <http://www.example.org/sortKey>  "Adriana Budura" ;
            <http://purl.org/dc/terms/creator>  <https://w3id.org/scholarlydata/person/karl-aberer> ;
            <http://www.w3.org/2000/01/rdf-schema#label>  "Neighborhood - based Tag Prediction" .
    <https://w3id.org/scholarlydata/person/adriana-budura>
            <http://xmlns.com/foaf/0.1/name>  "Adriana Budura" .
    <https://w3id.org/scholarlydata/person/karl-aberer>
            <http://xmlns.com/foaf/0.1/name>  "Karl Aberer" .
}

...

<urn:sparql-partition:58ryRBAb4Wh92LLn-4TLwyTTVaNvMaCBzA4aLUvLlk4=-99> {
    <urn:sparql-partition:58ryRBAb4Wh92LLn-4TLwyTTVaNvMaCBzA4aLUvLlk4=-99>
            <http://NEEDS_STANDARDIZATION/hasEntity>  <https://w3id.org/scholarlydata/inproceedings/iswc2002/proceedings/paper-28> .
    <https://w3id.org/scholarlydata/person/gerhard-friedrich>
            <http://xmlns.com/foaf/0.1/name>  "Gerhard Friedrich" .
    <https://w3id.org/scholarlydata/inproceedings/iswc2002/proceedings/paper-28>
            <http://purl.org/dc/terms/creator>  <https://w3id.org/scholarlydata/person/markus-zanker> ;
            <http://purl.org/dc/terms/creator>  <https://w3id.org/scholarlydata/person/alexander-felfernig> ;
            <http://purl.org/dc/terms/creator>  <https://w3id.org/scholarlydata/person/gerhard-friedrich> ;
            <http://www.example.org/sortKey>  "Alexander Felfernig" ;
            <http://purl.org/dc/terms/creator>  <https://w3id.org/scholarlydata/person/dietmar-jannach> ;
            <http://www.w3.org/2000/01/rdf-schema#label>  "Semantic Configuration Web Services in the CAWICOMS Project" .
    <https://w3id.org/scholarlydata/person/alexander-felfernig>
            <http://xmlns.com/foaf/0.1/name>  "Alexander Felfernig" .
    <https://w3id.org/scholarlydata/person/markus-zanker>
            <http://xmlns.com/foaf/0.1/name>  "Markus Zanker" .
    <https://w3id.org/scholarlydata/person/dietmar-jannach>
            <http://xmlns.com/foaf/0.1/name>  "Dietmar Jannach" .
}

Considerations for backward compatibility

None

edgardmarx commented 3 years ago

I will go a bit further and simplify a bit by removing the partititions and Entity and leaving a "graph" keyword after construct. E.g.

CONSTRUCT GRAPH {
  ?pub
    rdfs:label ?label ;
    dct:creator ?content ;
    eg:sortKey ?firstAuthorName .

  ?content foaf:name ?name .
}
WHERE {
  ?pub
    a conf:InProceedings ;
    rdfs:label ?label ;
    bibo:authorList ?list .

  ?list
    conf:hasFirstItem/conf:hasContent/foaf:name ?firstAuthorName ;
    conf:hasItem/conf:hasContent ?content .

  ?content foaf:name ?name .
}
ORDER BY ASC(MIN(?firstAuthorName))
LIMIT 100
OFFSET 50

Notice that this allows you to build any graph in the construct, not just entities. Further, you are specifying that the construct is a graph instead of triple centric. Finally, you do not need to specify the "start" of the graph (partition) as you can extract the same subgraphs starting by any of the variables in the construct.

Aklakan commented 3 years ago

Finally, you do not need to specify the "start" of the graph (partition) as you can extract the same subgraphs starting by any of the variables in the construct.

Hi @edgardmarx,

I am afraid your suggestion misses the point. (Maybe your example is incomplete?)

My proposal contains two aspects:

Conceptionally, these aspects are independent but actually they could be conflated to make things simpler. Instead of specifying by which variables to partition, the partition variables could implicitly be set to the single entity variable:

ENTITY ?y
CONSTRUCT { ?y a :Publication }
WHERE {
  ?y a :BibliographicResource ;
     :firstAuthorName ?fn
}
ORDER ENTITIES BY ASC(MIN(?fn)) 
OFFSET 50
LIMIT  100

This would translate to

CONSTRUCT { ?entity  a :Publication }
{
  { SELECT DISTINCT  ?entity {

    # Inner select to have slicing (i.e. limit/offset) work on the level of the entity keys
    { SELECT  ?entity (MIN(?fn) AS ?sortKey_1) {
        ?entity a :BibliographicResource ; :firstAuthorName ?fn
      } GROUP BY ?entity ORDER BY ASC(MIN(?fn)) OFFSET  50 LIMIT  100 }

    # Outer select to match the attributes
    ?entity a :BibliographicResource ; :firstAuthorName ?fn
  } ORDER BY ASC(?sortKey_1) ?entity
}

If there was no slicing and ordering, the inner select could be omitted and in this example the query would become

SELECT DISTINCT ?entity {
  ?entity a :BibliographicResource ; :firstAuthorName ?fn
} ORDER BY ?entity
namedgraph commented 3 years ago

@Aklakan I don't understand what's wrong with your current-SPARQL example with sub-SELECT? I've used this pattern many times and it worked fine.

What you say about <span>{{resource.label}}</span> is not an RDF problem, it's a presentation problem. The software layer has to be RDF-aware, not RDF has to be tailored to allow "entities" -- if it's not triples, it's not RDF. We have an RDF-aware presentation layer that is implemented in XSLT and works perfectly fine: https://github.com/AtomGraph/LinkedDataHub/tree/master/src/main/webapp/static/com/atomgraph/linkeddatahub/xsl

resource.label is a resource-level expression, and you for some reason want to apply it to a whole graph. If you first use an expression to select the resource, then the problem goes away. E.g. smth like:


graph.resources.filter(r.label).forEach(function() {
    ...
    <span>{{resource.label}}</span>
    ...
}, this);
edgardmarx commented 3 years ago

Dear @namedgraph, thanks for your interest. The problem that @Aklakan is trying to overcome is a bit more complex. See my explanation in https://github.com/w3c/sparql-12/issues/127#issuecomment-715479847 . Further, there is no guarantee that a query working in one triple store will work in another because that's totally dependent on the order that the triples were indexed. The problem is not related to the serialization format.

@Aklakan, you are totally right, my example missed the group by variable in which I will suggest to use the SPARQL syntax itself. Unless I am missing something, I think your example could be written as follows in my suggested syntax:

CONSTRUCT GRAPH {
  ?pub
    rdfs:label ?label ;
    dct:creator ?content ;
    eg:sortKey ?firstAuthorName .

  ?content foaf:name ?name .
}
WHERE {
  ?pub
    a conf:InProceedings ;
    rdfs:label ?label ;
    bibo:authorList ?list .

  ?list
    conf:hasFirstItem/conf:hasContent/foaf:name ?firstAuthorName ;
    conf:hasItem/conf:hasContent ?content .

  ?content foaf:name ?name .
}
**GROUP BY ?pub**
ORDER BY ASC(MIN(?firstAuthorName))
LIMIT 100
OFFSET 50
namedgraph commented 3 years ago

Sub-SELECT specifies the ordering? It needs ORDER BY to provide a stable ordering. I'm not sure why @Aklakan needed nested SELECTs though. The pattern that works for us is roughly:

CONSTRUCT
{
    ?resource ?property ?value .
}
{
    {
        SELECT *
        {
            ?resource rdfs:label ?label # get labelled resources
        }
        ORDER BY ?label
        OFFSET 0
        LIMIT 20
    }
    ?resource ?property ?value . # get the rest of their triples
}

If you turn it into a DESCRIBE, it becomes even shorter.

There's no way getting around the fact the RDF graph result you get is an unordered set of triples. So your presentation layer has to do a secondary sort regardless of the ORDER BY.

edgardmarx commented 3 years ago

Hey @namedgraph,

Thanks again for engaging in the discussion, you got the idea.

"There's no way getting around the fact the RDF graph result you get is an unordered set of triples"

Your example is already complicated with one single triple pattern, imagine building one with two or three linked by different variables.

That's the issue @Aklakan thinks should be simplified in the SPARQL Contruct syntax, and I totally agree.

namedgraph commented 3 years ago

My view is that SPARQL 1.2 should prioritize cases/features that are currently not even possible. This is not one of them.

Aklakan commented 3 years ago

There's no way getting around the fact the RDF graph result you get is an unordered set of triples.

Hi all, the fundamental question is whether SPARQL should have support to make it easier to work on the entity level. I'd clearly like to see quality of life improvements in this regard. You said yourself you used the pattern as well - which means you also wrote this query transformations in your client (because that's what everyone needs to before you can do graph.resources.filter(r.label)... [edit:] In my example in the initial post the named graphs that correspond to the partitions have an ordering but the triples within the named graphs are unordered.

My view is that SPARQL 1.2 should prioritize cases/features that are currently not even possible. This is not one of them.

Your comment just popped up - so yes, there is a fundamental difference in the view - for me a minor version increase does not necessarily have to provide great new features but could provide some quality of life improvements.

edgardmarx commented 3 years ago

Dear @Aklakan and @namedgraph I see this issue such as the 39 below.

https://github.com/w3c/sparql-12/issues/39

In short: Could you do it using simple SPARQL? Yes. Does it improve the syntax? Definitely.

Aklakan commented 3 years ago

@edgardmarx Hm, I'd say Point 3 in #39 recognizes the problem:

  1. RDF has just triples; how to delineate "circumscribe" a business object is non-trivial.

Not sure what you mean by

'Could you do it using simple SPARQL? Yes'

What is 'it'?:

39 is about outsourcing the problem of building entities to other means via a DESCRIBE ?x AS some:procedure.

The core idea of my proposal is about having a native mechanism in SPARQL to build entities - based on aggregation of solution bindings to conceptually (RDF term, RDF Graph) pairs exposed using named graphs in a specified order.

So it might be possible to transform a Shacl or Shex shape to an entity-based SPARQL query that yields the set of resources and corresponding triples that match the shape specification.

edgardmarx commented 3 years ago

@Aklakan I just meant to say that you could overcome the problem of the issue https://github.com/w3c/sparql-12/issues/39 with a Select query.

Aklakan commented 3 years ago

What you say about {{resource.label}} is not an RDF problem, it's a presentation problem. ` [..] graph.resources.filter(r.label).forEach(function() { [..]

It's not a presentation problem but a set theoretic problem:

graph.resources.filter(r.label) is nothing else than SELECT DISTINCT ?s WHERE { ?s rdfs:label [] } on the RDF graph that was supplied to your application by perhaps another SPARQL query. Instead of having a single sparql query on the supplier side that specifies the set of (RDF term, graph fragment) pairs for the consumer to operate on, right now we need a second specification of the set of resources on the consumer side that has to be in sync with the supplier. If in the example data the authors and publications both had labels then the consumer needs to repeat the pattern for selecting specifically the publications - something that the supplier knew all along - but the supplier cannot communicate that in a standard way. That's the core of the problem.

[edit: I am assuming an architecture where the view is 'dumb' - it just works on the resources supplied to it; e.g. validation of the involved data is an orthogonal concern]

namedgraph commented 3 years ago

So why don't you build your UI on the SELECT result table?

You can get a graph (CONSTRUCT/DESCRIBE result) from a projection (SELECT result), but not the way around. I'm sure you know this.

It seems that you want have your cake (graph) and eat it (treat it as a table) too. Since you cannot do that, you need a secondary projection in the UI layer.

VladimirAlexiev commented 3 years ago

also related to #48

afs commented 3 years ago

Related: #86 -- CONSTRUCT DISTINCT and REDUCED Related: #31 -- CONSTRUCT GRAPH

TallTed commented 3 years ago

Related: #33 -- SELECT ... FROM CONSTRUCT ...

Aklakan commented 3 years ago

Related: #33 -- SELECT ... FROM CONSTRUCT ...

33 is about rewriting SPARQL SELECT queries over views similar to the SPARQL-to-SQL. The difference is that in #33 a set of CONSTRUCT queries takes the role of the view definitions (for which there is R2RML for the SQL world)

This is an orthogonal feature to those related to somehow shaping data objects. A data object is ID + state. In RDF this translates to a resource plus a graph fragment. General objects have behavior in addition.

justin2004 commented 3 years ago

a corresponding quad-based result format

http://www.scholarlydata.org/sparql/ might not support trig and n-quads but if it did you could construct quads:

CONSTRUCT {
graph ?pub {
  ?pub
    rdfs:label ?label ;
    dct:creator ?content ;
    eg:sortKey ?firstAuthorName .

  ?content foaf:name ?name .
  }
}
...

you can do that with jena.

@Aklakan does that help?

afs commented 3 years ago

See also : #31

Aklakan commented 3 years ago

Hi @justin2004 The 'graph name which includes a subject with the same IRI` pattern does indeed help with parts of the raised issue and I am using it quite alot nowadays.

Does this pattern actually have any established name?

The limitation is that named graphs can only be IRIs. While literals cannot appear as subjects anyway, blank nodes still require non-portable workarounds to craft ad-hoc IRIs from the bnode labels.

In principle Construct Graph in #31 could have a generalized RDF flavor, but that would be quite incompatible with existing syntaxes and tooling I suppose.

The PARTITON BY extension would yet still be useful to have limit / offset work on on a level that's suitable for entities out-of-the-box.

I guess for most cases the proposed ENTITY keyword could indeed mostly be covered by Construct { Graph ?x { ?x ... } }. An application can also parse this pattern out from the construct template in case it want's to inject additional triple patterns for fetching information related to ?x.

VladimirAlexiev commented 3 years ago

@Aklakan (and @edgardmarx ) this is very much needed.

But don't we also need to figure out how to do this for any number of nested levels? Examples:

@namedgraph can you show these 2 in SPARQL 1.1?

edgardmarx commented 3 years ago

@VladimirAlexiev @Aklakan ,

If you think about, what we are proposing is exactly the mechanism behind GraphQL but using SPARQL with many advantages. SPARQL is a full GRAPH query language while GraphQL is a CSS way of querying something :-). Thus, I will say, you can bound the level by what is expressed in the query itself, ofc you can benefit from having/using RDF*.

TallTed commented 3 years ago

@VladimirAlexiev -- Your examples will require moderately complex SPARQL queries. The first should be doable without a subquery, unless you're federating things because, for instance, the alma mater records are in a different RDF store than the papers & their authorship. The second may require 2 or 3 subqueries, depending on where the data is stored.

SPARQL is solved from inside-out (sometimes referred to confusingly as bottom-up, which may be incorrectly interpreted as starting at the bottom of the lexical query, rather than starting at the deepest subquery), so that's where you put the crux of each query (i.e., the graph pattern that delivers 100+ papers along with (optionally) an ORDER BY clause and the LIMIT 100 clause).

That subquery should be able to also return the first author of those 100 papers, with the addition of another line of graph pattern.

For your first example, if the alma mater records are in the same graph, you just need to add an OPTIONAL pattern for the first author's alma mater, presuming that its presence in the graph means it can be shared. If you also need to test for that share-ability, there would obviously be a bit more complexity -- but probably only a bit.

For the second example, it's more complicated (as you know), but it should be doable with pure SPARQL. It can get easier if you're working with a multi-model, multi-language SPARQL processor such as Virtuoso's hybrid of SQL & SPARQL.

(ObDisclaimer: OpenLink Software produces Virtuoso, both FOSS and Commercial Editions, and employs me.)

Aklakan commented 2 years ago

Does this pattern actually have any established name?

graph-per-entity sounds simple enough :)

But don't we also need to figure out how to do this for any number of nested levels

Hm good question - so far I was thinking that it might be sufficient only having this PARTITION BY feature on the top level while on inner levels correlated joins (#100) could be used - in fact #100 shows an example SQL syntax (postgresql) where the PARTITION BY keyword is used to number items within a group - I am not yet too familiar with the semantics of that feature of postgresql but it seems there is some overlap that might serve as inspiration for a SPARQL-based solution.

namedgraph commented 2 years ago

I'm struggling to follow all the proposals, but we have one use case that could related. Using sub-queries for pagination and then automatically wrapping them into graph queries like this

DESCRIBE ?thing
{
    SELECT ?thing ?smth
    {
        ?thing ?p ?smth .
        ...
    }
    LIMIT 20
    ORDER BY ?smth
}

we had to impose a restriction that the sub-queries must return one resource (in this case ?thing) per result binding. Otherwise the pagination in the final graph (DESCRIBE results) will break (in this case return less than 20 resources).

VladimirAlexiev commented 2 years ago

@namedgraph I think what you describe is closely related to the matter described in #100.

If an entity's RDF description needs to include a bunch of sub-entities, each with its own order and limit, then I don't believe this can be done with plain SPARQL 1.1 inside-out evaluation.