ropensci / rdflib

:package: High level wrapper around the redland package for common rdf applications
https://docs.ropensci.org/rdflib
Other
57 stars 9 forks source link

Named Graphs #23

Closed pdatascience closed 6 years ago

pdatascience commented 6 years ago

As sometimes it is necessary to follow the provenance of a triple with named graphs https://en.wikipedia.org/wiki/TriG_(syntax), I was wondering if there is a way to express them via rdflib?

Example:

 :G1 { :Monica ex:name "Monica Murphy" .      
       :Monica ex:homepage <http://www.monicamurphy.org> .
       :Monica ex:email <mailto:monica@monicamurphy.org> .
       :Monica ex:hasSkill ex:Management }

One way to express this would be if the objects are stored internally as quads {context, subject, predicate, object}. In (RDF4R)[https://github.com/pdatascience/rdf4r/tree/master/R] I use a less flexible approach - context is a serialization parameter and everything that is inside a ResourceDescriptionFramework object gets serialized in the given context.

Edit: Just realized that the example is wrong (I had googled some working draft).

cboettig commented 6 years ago

@pdatascience Good question. How objects are stored internally is determined by the redland library, which has some pretty strong opinions on the subject, please read: http://librdf.org/notes/contexts.html

In particular, recall that redland has it's own internal storage model which can be serialized in whatever form you like it (e.g. nquads, in fact, you've probably noticed that's my default print format for the rdf objects in this package). However, the backend storage models all use something much more performant that just serializing rdf as nquads etc, (i.e. using Virtuoso, Postgres, SQLite, BDB, or in memory versions, etc). Not all backend storage mechanisms can support contexts, so using contexts creates an extra burden on the user where they have to be careful about this. Even for those backends that do support it, as you can see in the redland documentation, contexts have to be enabled explicitly.

So yeah, contexts are possible from the low-level api but add complexity here about compatibility etc.

Can you say a bit more about the use case you have in mind?

pdatascience commented 6 years ago

The use case is as follows. We RDF-ize taxonomic treatments or taxonomic articles published by Pensoft and Plazi. In the very simple case you might have a taxonomic article serialized like

 :d7219741-31da-4e99-9257-02afe41dd3b4  {
 :d7219741-31da-4e99-9257-02afe41dd3b4   rdf:type   fabio:JournalArticle ;
     skos:prefLabel   "10.3897/zookeys.716.21150" ;
     dc:title   "A new species of the carpenter bee genus Xylocopa from the Sarawat Mountains in southwestern Saudi Arabia (Hymenoptera, Apidae)"@en ;
     prism:doi   "10.3897/zookeys.716.21150" .
# and so on
}

but then this article might say something about Apidae such as that there is a new occurrence record somewhere. Then this information will get RDF-ized and put in a named graph. I am just reusing the ID of the article because it is essential to know who made this claim and if need filter it out later on.

pdatascience commented 6 years ago

Another use case would be nano-publications http://nanopub.org

cboettig commented 6 years ago

@pdatascience ah, but this is no problem, and can easily be done with triples in the usual way, with using URIs for subjects and objects. e.g. consider library example on json-playground. This doesn't need the a fourth position for a name of the entire graph, you just want subject and object names (er, identifiers).

cboettig commented 6 years ago

Just to follow up on this.

I think there are a some arguable edge cases for named graphs, but I think they are rarely necessary and easy to abuse. For instance, in your example, I think you'd rather want:

 :d7219741-31da-4e99-9257-02afe41dd3b4  
         rdf:type   fabio:JournalArticle ;
     skos:prefLabel   "10.3897/zookeys.716.21150" ;
     dc:title   "A new species of the carpenter bee genus Xylocopa from the Sarawat Mountains in southwestern Saudi Arabia (Hymenoptera, Apidae)"@en ;
     prism:doi   "10.3897/zookeys.716.21150" .
# and so on

I really feel that the nanopub spec is another example where graph names are doing more harm than good. There's absolutely no reason to name the graphs in those cases either, and it replaces a very nice data model with a semantically very convoluted one.

For instance, it is natural to think of a nanopub (or any other pub) having properties like authors, date created, etc; but an object of type Nanopublication does not. Instead, it has a graph-valued property "publicationInfo", which doesn't appear to be a type of anything, which has these properties. Adding this additional layer of grouping doesn't require the use of a graph name at all, we could just do (using JSON-LD serialization of RDF, which I find more intuitive and practical than turtle)

{   
      "@type": "nanopub:Nanopublication",
      ...
      "hasPublicationInfo": {
        "@id": "NanoPub_1_Pubinfo",
        "pav:authoredBy": ["http://www.researcherid.com/rid/B-6035-2012", "http://www.researcherid.com/rid/B-5927-2012"],
    ... 
}

i.e. you can get the block of "publicationInfo" just using the subject URI, NanoPub_1_Pubinfo", rather than a graph URI.

The same is not true with the provenance block, because it refers to the nanopub URI directly, i.e. we need a quad rather than a triple to say: <provenance_graph> <Assertion> <wasDerivedFrom> <some_object_reference>, but again, this only arises from making the semantics needlessly convoluted. It would make way more sense to put provenance as a node-valued property of the relevant assertion, rather than as a graph-valued property the nanopub, e.g. like so:

  "nanopub:hasAssertion": {
        "@id": "NanoPub_1_Assertion",
        "@type": "sio:statistical-association",
        "sio:has-measurement-value": {
          "@id": "Association_1_p_value",
          "@type": "sio:probability-value",
          "sio:has-value": 0.0000656211037469712
        },
        "hasProvenance": {
          "opm:wasDerivedFrom": "http://rdf.biosemantics.org/vocabularies/text_mining/gene_disease_concept_profiles_1980_2010"
        }
}

I highly recommend looking at examples in things like schema.org/CreativeWork and schema.org/Dataset, you can make a lot of use of referring to very nested / complexType objects by ids without the named graph concept. I also highly recommend JSON-LD and the JSON-LD playground for fiddling with these concepts; there's both great developer-friendly tooling around JSON and the design of json-ld with @id and @type and @context is super clever and i think you will find more intuitive than turtle or the other serializations.

pdatascience commented 6 years ago

@cboettig This is a response to my example. Nano-pub comment will follow.

Certainly, there is a way to model provenance with pre RDF 1.1 (no named graphs) but I don't quite understand your suggestion. It is probably my bad for not giving the example fully.

  1. I have the article info as stated above.

But then

  1. I have some triples that have nothing to do with the article. I am making this up now for the example as the actual model in the ontology is a little bit tricky. Look here if you're interested (https://github.com/darwin-sw/dsw).
 :d7219741-31da-4e99-9257-02afe41dd3b4 {
bioimages:966 a dsw:IndividualOrganism ;
 dsw:hasOccurrence bioimages:em2447#occ .
}

Then I put these triples inside the a named graph (which happens to correspond but doesn't have to) to the ID of the article, so that I know to track the provenance of these statements to the article.

In a nutshell, I put all of the triples (including those that don't have subject or object = article URI) that I'm extracting from any given article in different a named graph to keep track of what came from where.

This also makes removing all the triples that are stated in an article trivial - just use DROP GRAPH.

pdatascience commented 6 years ago

Now to the nanopub example. I agree that provenance can be tracked by creating an assertion object. But Pensoft wants to use the nanopub.org schema in the future, so I need to support it in RDF4R.

Back to my immediate case, though: If I make everything that an article states an assertion -- I believe ontologists call this process reification -- in my opinion, I would be adding a layer of semantic complexity instead of syntactic complexity. I would need to define an assertion class and its properties. Do I need different types of assertions about different things, or do I add a generic assertion class that is a way to express any triple (e.g. hasSubject, hasObject, hasPredicate)?

The system is already in production and uses named graphs. Certainly, this can change in the future but for now it is impossible to change due to time constraints. We will be presenting the updated version of the system at TDWG and I can only work on it till July (this is when my Ph. D. ends).

You say that the redland model supports named graphs. This means I could write something for my use-case in the fork of rdflib and you could decide whether to pull it into rdflib. Maybe a fourth optional parameter to rdf_add or something similar. As redland is a little overwhelming, I would be really thankful if you point me in the right direction of how to do a named graph Hello World in redland.

pdatascience commented 6 years ago

P.S.: Just thought of that: what happens to inference if triples are not triples but assertions and thus the subject, predicate, and object, are all objects of some assertion instance?

PPS: Also edited my initial post because I realized that my first example was wrong as I had copy-pasted from some draft paper. Maybe this leads to some confusion about what I mean.

cboettig commented 6 years ago

Thanks for the replies, it's interesting to see your use cases. I think it would help to look at a less trivial example of what you're trying to do, but it sounds like you are using graphs in place of classes. e.g. you put the article in one named graph and the triples describing the bioimage in anther named graph? I think in every case your named graphs should be classes. To associate some information with a publication, you want an identifier for the thing that class "publication", not to a graph that just happens to contain only the paper.

So taking a richer example from the Darwin SW example you linked:

 r <- rdf_parse("https://raw.githubusercontent.com/darwin-sw/dsw/master/examples/dsw-example1.rdf")

sure it's more complex, but it doesn't contain any named graphs. These are all normal triples, no use of context. Take a look at the JSON-LD representation and the RDF tabular representation for this same file here: http://tinyurl.com/yba2g3l4

Re nanopub, yeah, I get that you're locked into the existing schema; though I think there are other well-established existing schemas you might consider too, like schema.org, which is used by both the major search engines and scientific repository organizations like DataCite. I'm afraid I didn't follow your questions about assertions -- I'm not proposing using assertions, that's coming straight from the nanopub spec: every nanopub has three parts: pub info, assertion, provenance (as three named graphs). My example was taken directly from the nanopub example here: http://nanopub.org/wordpress/?page_id=57, showing that the same statements could be represented more easily without named graphs

pdatascience commented 6 years ago

@cboettig I also thank you for taking the time to look at my use cases.

I should have used another example -- preferably real-world -- from the very beginning instead of trying to write something ad-hoc for the sake of simplicity. Let us for now also forget the Nanopubs -- I brought them up as an established standard that does use named graphs but I do not have an informed opinion on how they compare to JSON-LD based systems, schema.org, or others.

I do not use graphs instead of classes. Indeed, I consider a graph to be collection of nodes and edges. In the RDF-model a graph is a set of triples. I use RDF contexts in the sense of RDF 1.1. A context (graph name) is a resource identifier just like the resource identifiers you use for subjects, predicates, and possibly literals. Thus, contexts are a way of partitioning the default graph formed by all RDF statements in your triple store into named graphs.

I consider a class to be something that you define in an ontology via owl:Class (see for example the ontology that I wrote for biodiversity publishing https://github.com/pensoft/OpenBiodiv/blob/master/ontology/openbiodiv-ontology.ttl). Naturally, classes are also represented as identifiers but semantically have a different meaning than instances of that class (think of Russell's class theory). Namely the relationship between a class and an instance is often encoded as

<instance> rdf:type <class>

In the universe of discourse of OpenBiodiv, articles are represented as instances of fabio:JournalArticle http://www.sparontologies.net/ontologies/fabio. I have already shown how to model articles before. However, our universe of discourse contains concepts that have nothing to do with publishing. For example, one of the latest BDJ articles https://bdj.pensoft.net/article/22175/instance/3801129/ states there is an occurrence of Gnophomyia acheron Alexander, 1950 in the Kivach Nature Reserve in Karelia. You can use Darwin-SW to express such statements as RDF irrespective of whether or not they come from a taxonomic article. In particular, you would write this:

<organism1> a dsw:Organism ;
dsw:hasIdentification <identification1> ;
dsw:hasOccurrence <occurrence1> .
<occurrence1> dsw:occurrenceOf <event1>.
<event1> dsw:locatedAt <location1>.
<location1> dwc:contry "Russia";
  dwc:stateProvince "Karelia" .
<identificiation1> dsw:toTaxon <taxon1>.
<taxon1> a openbiodiv:TaxonomicConcept ;
  openbiodiv:hasTaxonomicConceptLabel <tcl1>.
<tcl1> a openbiodiv:TaxonomicConceptLabel ;
  dwc:genus "Gnophomyia" ;
  dwc:specificEpithet "acheron" .

This is still somewhat of a stripped-down example but there is no ad-hoc "cheating for the sake of simplicity" in it.

Now here is what we want to express in English:

  1. We want to express the article metadata: DOI, authors, etc.
  2. We want to express the occurrence information.
  3. We want to express the provenance, i.e. we want be able to track down that particular occurrence information came from a particular article.

Modeling-wise we do in the following way:

  1. We express the article metadata as triples according to SPAR (FaBiO).
  2. We express the occurrence information as triples according to OpenBiodiv-O and Darwin-SW.
  3. We put both of these RDF-izations in the same named graph.
  4. For the name of the graph we reuse the identifier that we had used to denote the article ID.
<http://doi.org/10.3897/BDJ.6.e22175>
{
  <http://doi.org/10.3897/BDJ.6.e22175> a fabio:JournalArticle ;
    prism:doi "10.3897/BDJ.6.e22175" ;
    dc:creator <person1> .

  <organism1> a dsw:Organism ;
    dsw:hasIdentification <identification1> ;
    dsw:hasOccurrence <occurrence1> .
  <occurrence1> dsw:occurrenceOf <event1>.
  <event1> dsw:locatedAt <location1>.
  <location1> dwc:contry "Russia";
     dwc:stateProvince "Karelia" .
  <identificiation1> dsw:toTaxon <taxon1>.
  <taxon1> a openbiodiv:TaxonomicConcept ;
    openbiodiv:hasTaxonomicConceptLabel <tcl1>.
  <tcl1> a openbiodiv:TaxonomicConceptLabel ;
    dwc:genus "Gnophomyia" ;
    dwc:specificEpithet "acheron" .
}

What happens mathematically here is that on top of the binary relations that are defined between <http://doi.org/10.3897/BDJ.6.e22175> and <person1>, the literal node with the DOI, etc., there is n-ary relationship between <http://doi.org/10.3897/BDJ.6.e22175> and all the subjects, predicates, and objects in the named subgraph, or -- I prefer to look at it this way -- a relationship between the <http://doi.org/10.3897/BDJ.6.e22175> and the set of triples in the subgraph. There is no way to define a relationship between a node and a triple in the original RDF model, as only binary relationships are allowed.

That is not to say that there is no way to express the semantics of the above in "just triples." As I stated before, one can reify the organism occurrence information as assertions.

<http://doi.org/10.3897/BDJ.6.e22175> :makesAssertion <assertion1>.
<assertion1> ## But what happens here??

However, you would need to have a separate model (you cannot reuse OpenBiodiv-O or Darwin-SW) to model the species or occurrence information as assertions. One of the ideas behind named graphs is that you can reuse all of the domain ontologies and add provenance information (where did the facts come from?) without reifing the statements made in said domain ontologies as assertions. This is achieved by giving a context (fourth argument) to every triple. Thus I can take the triple

<tcl1> dwc:genus "Gnophomyia" .

and ask for its context, which is http://doi.org/10.3897/BDJ.6.e22175. In my model this also happens to be an instance of fabio:JournalArticle (not class!), which has some properties that let me locate its metadata.

One can argue about whether this is the only way or the best way to do the provenance tracking but I do not see a logical flaw or an error here. It is possible to track provenance like this and leading graph databases such as GraphDB make use of contexts extensively. I do not particularly like extensions of the RDF model ad infinitum -- Rod Page is now talking of hex-stores (apparently a 1950's idea predating the Internet), etc. On the other hand, I do see a value in more expressive models than just triples -- for example Neo4J has properties on edges, something that in RDF has to be achieved via reificiation of property nodes. I try to be as practical as possible in these matters and have found named graphs to work for me excellently and to actually reduce complexity compared to reification.

Let me know if I've made my use-case clear so that I can be more qualified to contribute to the discussion of whether or not it is a good thing to have this functionality in rdflib :)

and sorry for the long post

pdatascience commented 6 years ago

Dear Carl,

I don't know if you managed to look at my somewhat verbose post.

Regardless, I think the issue is now superfluous/solved, as I have figured out how to integrate rdflib fully into RDF4R and have named graph support. The way to do it is to wrap an rdflib object inside an RDF4R object and during serialization to Turtle, use rdf4r to generate the opening graph name, the opening {, the closing }, and dump the rdflib output in-between. This is somewhat of a hack, but should work ok. I might need to do a little more processing and move any prefix statements that rdflib dumps outside of the named graph, but this should also be doable via some text-processing.

Anyway, unless you have further questions for me, I suggest you close this rdflib issue. I would also like to let you know that I have moved RDF4R to a [new location] (http://github.com/vsenderov/rdf4r) under my real-name. I consider RDF4R stable will be releasing version 1.0 and looking for ways to publish it.

It will be a library that I plan to keep working on as I do have a lot of ideas on how to improve it and extend it, but I thought releasing now might be OK. I will also start looking at the ROpenSci onboarding process and will be very happy if it makes it there. Either way, thanks a lot for the ideas and the collaboration!

Best Viktor

cboettig commented 6 years ago

Thanks for the updates!