w3c / rdf-ucr

https://w3c.github.io/rdf-ucr/
Other
5 stars 1 forks source link

RDF-Star: Some biological database use cases #19

Open JervenBolleman opened 1 year ago

JervenBolleman commented 1 year ago

See https://github.com/w3c/rdf-ucr/wiki/RDF-star-for-explanation-and-provenance-in-biological-data for a clean version of this use case

** Contact information

** Brief Description of your use case:

In UniProt we want to refer to triples to explain or "attribute" why they where added to the UniProtKB graphs. These triples are always asserted and we might have multiple explanations/attributions or none at all. The explanations and attributions are themselves complicated resources named by an IRI.

At this moment we use RDF reification with consistent IRI's for each triple.

<Q14739#SIP9E6E0C5B850FBF4F> up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
<#_9297DC3B72792B1A_up.fullName_D4E77F494F58CEA9> rdf:type rdf:Statement ;
  rdf:subject <Q14739#SIP9E6E0C5B850FBF4F> ;
  rdf:predicate up:fullName ;
  rdf:object "3-beta-hydroxysterol Delta (14)-reductase" ;
  up:attribution <Q14739#attribution-XX> .

Q14739#attribution-XX> up:manual true ;
  up:evidence ECO:0000303 ;
  up:source citation:16784888 .

This syntax is inconvenient and also hard to optimize in general. This is important when the RDF graphs are 100+ billion triples in size.

The example above is evidence to support why a certain protein is described with a specific name.

As the data is extremely large we can not afford to maintain mappings that depend on order of visitation inside a single file to derive an temporary IRI. (e.g. in RDF/XML rdf:ID uniqueness constraint is violated and expensive to check for in UniProt when using it for reification quads). In other words, the identity function for deriving an id for a triple should be stateless and allowed to be invoked multiple times, we should not be forced to gather all triples for using a triple reference into one co-localized set.

Our use-cases for un-asserted triples are extremely rare and would preferably be described explicitly as "inversions" of the normal case, or explicit non-membership of an class. e.g. something like the following

uniprot:P1 owl:disjointWith <things_named_X> .
<<uniprot:P1 owl:disjointWith <things_named_X> >> rdfs:comment "P1 does not have an Xthingy so should not be called an X" . 

For other databases we might want to do things like .

ex:1 ex:likes ex:2 .
<< ex:1 ex:likes ex:2 >> ex:confidence ex:high .

and then use the "star" syntax for quickly selecting the triples we have a high confidence for.

*** What you want to be able to do:

Talk about why triples are added to the dataset and how confident our users should be in trusting them.

*** What is the role of RDF-star quoted triples in your use case:

Quoted triples (or content identified triples) would replace the usecase for rdf reification by allowing a more convenient and clearer way to talk about "edges" in an RDF graph.

*** Why it is hard or impossible to do what you want to do without quoted triples:

Reification, not only is a lot of typing to get right. It is also difficult to optimize in the general case for SPARQL engines.

*** How you want quoted triples to behave in your use case: (For example, do you want the precise syntax of subjects, predicates, and objects in quoted triples to be important?)

They must be transparent for owl reasoning. UniProt is re-used and re-mixed in many different end user databases. In these they might use different identifiers and map them with owl:sameAs. e.g. often http://identifiers.org/uniprot/X owl:sameAs http://purl.uniprot.org/uniprot/X. Given that sameAs relation all queries should be able to use either of these identifiers and get the "same" result.

SELECT * WHERE { << <http://purl.uniprot.org/uniprot/X> ?p ? o >> }

and

SELECT * WHERE { << <http://identifiers.org/uniprot/X> ?p ? o >> }

must return the same results in an owl:sameAs aware setting.

TallTed commented 1 year ago

If I understand you correctly, the closing two example queries should actually be —

SELECT * WHERE { << <http://purl.uniprot.org/uniprot/X> ?p ?o >> ?p1 ?o1 }

— and —

SELECT * WHERE { << <http://identifiers.org/uniprot/X>  ?p ?o >> ?p1 ?o1 }
pfps commented 1 year ago

Sorry for the delay in getting this use case organized.

I don't fully understand the part of your use case where you say: "As the data is extremely large we can not afford to maintain mappings that depend on order of visitation inside a single file to derive an temporary IRI." Is this a problem because you don't know the IRI when you query and thus need to do a complex SPARQL query to find the reification and from there the attribution triple?

I have started creating a wiki page for your use case, that will eventually contain a clean description of the use case when we finish teasing out all its aspects. Please take a look at https://github.com/w3c/rdf-ucr/wiki/RDF-star-for-explanation-and-provenance-in-biological-data

lisp commented 1 year ago

what can the pattern implicit in the initial example represent which is beyond a pattern which relies on named graphs? for example:

<Q14739#SIP9E6E0C5B850FBF4F> up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .

<Q14739#attribution-XX> {
  <Q14739#SIP9E6E0C5B850FBF4F> up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
  <Q14739#attribution-XX> up:manual true ;
  up:evidence ECO:0000303 ;
  up:source citation:16784888 .
  # where, should it be necessary to abstract over the predicate, one could include
  <Q14739#attribution-XX> <Q14739#predicate> up:fullName .
}

this offers the advantage, that it could compactly annotate related triples.

JervenBolleman commented 1 year ago

what can the pattern implicit in the initial example represent which is beyond a pattern which relies on named graphs? for example:

<Q14739#SIP9E6E0C5B850FBF4F> up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .

<Q14739#attribution-XX> {
  <Q14739#SIP9E6E0C5B850FBF4F> up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
  <Q14739#attribution-XX> up:manual true ;
  up:evidence ECO:0000303 ;
  up:source citation:16784888 .
  # where, should it be necessary to abstract over the predicate, one could include
  <Q14739#attribution-XX> <Q14739#predicate> up:fullName .
}

this offers the advantage, that it could compactly annotate related triples.

We would end up with multiple graph membership patterns in practice and that will not be so easy to query for either. i.e. a usecase we have is where we load a number of uniprot releases into different named graphs. At that point it becomes really hard to query for an "attribution" that we put into release 1 and is not present in release 2.

lisp commented 1 year ago

... it becomes really hard to query for an "attribution" that we put into release 1 and is not present in release 2.

if you are actually distinguishing quads, rather than triples, and your statement ids incorporate four terms rather than just three, why should it be difficult to distinguish the attributions?

JervenBolleman commented 1 year ago

@lisp the query becomes something like this when using named graphs. Which not really nicer than the reification we have now ;)

SELECT *
WHERE {
  GRAPH release:1 {
       ?p up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .       
       ?attr up:manual true ;
       up:evidence ECO:0000303 ;
       up:source citation:16784888 .
  }
  GRAPH ?attr {
       ?p up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
       ?attr up:manual true ;
       up:evidence ECO:0000303 ;
       up:source citation:16784888 .
    }
  }
}

This is selecting the attributions of a certain kind was in a specific release.

niklasl commented 1 year ago

For named graphs to be used in this case, wouldn't the attribution facts still be in the "release" graph? But indeed, the "quoted triple" could be expressed a singleton graph (or indeed, multiple triples if that's preferred).

So this would be in one specific uniprot release:

graph release:1 {
    <Q14739#SIP9E6E0C5B850FBF4F> up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
    # This annotates the above triple, which is quoted in the uniquely named graph (further down):
    <urn:tdb:2014:urn:md5:8e08a975b841666a8ff0b7e42e73275a> up:attribution <Q14739#attribution-XX> .
    <Q14739#attribution-XX> up:manual true ;
        up:evidence ECO:0000303 ;
        up:source citation:16784888 .
}

And this would be the same in all releases where the triple is annotated (or merely quoted if it is talked about but not asserted in some specific release):

# The triple, quoted by keeping it in an "existential" singleton named graph,
# uniqely named by a checksum of its NQuads representation:
graph <urn:tdb:2014:urn:md5:8e08a975b841666a8ff0b7e42e73275a> {
  <Q14739#SIP9E6E0C5B850FBF4F> up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
}

Selecting attributions of a certain kind in a release would then be:

select * {
    graph ?triple {
        ?prot up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
    }
    graph release:1 {

        # This is the only thing ensuring that ?triple isn't bound to any
        # other named graphs (release) where it is asserted:
        ?triple up:attribution ?attr .

        # The attribution criteria:
        ?attr up:manual true ;
          up:evidence ECO:0000303 ;
          up:source citation:16784888 .

        # Needed if the triple must also be asserted in this release:
        ?prot up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
    }
}

Of course, this is a "hack", with non-standardized, manual requirements:

  1. You must carefully mint predicable IRI:s unique for the existential graph content (in constrast to skolemization). This is "easily" standardized if desirable (since implementations of quoted triples probably need something similar internally anyway). For more than singleton graphs though, we need full RDF C14N. And blank nodes would require consistent skolemization over the dataset.
  2. You need to add lots of named graphs, here, one per annotated triple. (At least it is idempotent to describe such existential graphs more than once, since they would only be re-asserted "types" and not be distinct "tokens".)
  3. For this to work for quoted but not ever asserted triples, these existential graphs MUST NOT be in the union graph.

These can be quite challenging (and some could prove a no-go depending on backend), so this pattern is reasonably not good enough as a recommended practise. I do, however, wonder how much further quoted triples need to go beyond supporting it. (See https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023May/0063.html for more on this idea.)

A careful reading of https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3199260, p. 6-7 is also advisable when considering this pattern and possible semantics thereof. (For one, it speculates on a regime where the above could entail <urn:tdb:2014:urn:md5:8e08a975b841666a8ff0b7e42e73275a> a rdf:Statement...)

See also rdf-concepts#46.