tdwg / tnc

Taxonomic Names and Concepts Interest Group
22 stars 7 forks source link

TCS is just fine as it is #7

Closed rdmpage closed 4 years ago

rdmpage commented 6 years ago

I propose a different, hopefully complementary, approach to this discussion. I am going to argue that the existing TCS (i.e., the TDWG LSID version https://github.com/tdwg/ontology/tree/master/ontology/voc) is absolutely fine and exactly what we need. If you disagree I'm going to ask you to prove me wrong. The way I think we could test whether TCS works is to explicitly test it. This will require some messing about with RDF and SPARQL (gack) but I think this way we can make some progress.

Idea

I propose the following:

  1. we have a triple store that has the TCS loaded up, together with some real examples.
  2. We test the assertion that TCS does everything we need by writing SPARQL queries.
  3. If we can write a sensible query then TCS passes that test
  4. If we can't write a query, or if we could write a better query if TCS was modified, then we have a list of things we need to change/add.

I know SPARQL is hideous at times, but it is powerful, and it enables us to have explicit tests that we can discuss. Note that the core of SPARQL is matching paths in graphs, so if we have a model of the things we care about and their relationships, and we can draw paths between them ("connect the dots"), then we can convert that into SPARQL.

SPARQL

OK, so I've set up a SPARQL interface here: https://funny-leather.glitch.me It is connected to a triple store that I will add data to (if anyone has data they want added, just let me know. If anybody wants to work on some queries, also let me know).

Examples

We have a bunch of use cases from @deepreef and @jgerbracht. So what I have in mind is taking these one by one and, if they are in scope, coming up with a query to address each use case. To start with I've loaded an example name Begonia elachista Moonlight & Tebbitt urn:lsid:ipni.org:names:77160201-1 into the triple store.

Names Test 0: Is this a name?

The first test is whether a string is a scientific name. Not on the use case list, but a starting point. So, is Begonia elachista a name? The SPARQL query is:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>

SELECT * 
WHERE { 
  ?thing tn:nameComplete "Begonia elachista" .
  ?thing rdf:type ?type . 
  ?type rdfs:label ?label .
} 

You can run the query here and see that the answer is "yes":

thing type label
1 urn:lsid:ipni.org:names:77160201-1:1.2.1.2 http://rs.tdwg.org/ontology/voc/TaxonName#TaxonName Taxon Name

"Begonia elachista" has type http://rs.tdwg.org/ontology/voc/TaxonName#TaxonName and via the TCS vocabulary we can get an English language string saying that type is a "Taxon Name". +1 for TCS.

Names Test 1: In what publication was a scientific name first established?

This is the first question on @deepreef 's list. The query is:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tcom: <http://rs.tdwg.org/ontology/voc/Common#>

SELECT * 
WHERE { 
  ?thing tn:nameComplete "Begonia elachista" .
  ?thing tcom:publishedIn ?publishedIn . 
} 

Try it

And the answer is:

thing publishedIn
urn:lsid:ipni.org:names:77160201-1:1.2.1.2 Eur. J. Taxon. 281: 5. 2017 [17 Feb 2017] [epublished]

Now ideally we'd have a DOI for this publication (it is https://doi.org/10.5852/ejt.2017.281 ) but IPNI often doesn't know the DOI, and even if it does it doesn't include them in the RDF. But there is a term in TCS that we could use, so if we added the DOI to the triple store we would get the DOI. So +1 to TCS.

Note that if we want something more granular than a link to the level of article, and the 5 in "Eur. J. Taxon. 281: 5. 2017 [17 Feb 2017] [epublished]" is a more granular link at the level of page, then we will need additional terms (such as the W3C annotation terms).

Names Test 6: Where is the type specimen for a scientific name?

I'd argue that this query is partly outside the scope of TCS as once we link to a specimen then it's up to whatever vocabulary describes the specimen to give us that information. However, with some databases we can get close:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tcom: <http://rs.tdwg.org/ontology/voc/Common#>

SELECT * 
WHERE { 
  ?thing tn:nameComplete "Begonia elachista" .
  ?thing tn:typifiedBy ?typifiedBy . 
  ?typifiedBy tn:typeSpecimen ?typeSpecimen .
}

Try it

thing typifiedBy typeSpecimen
urn:lsid:ipni.org:names:77160201-1:1.2.1.2 b0 Moonlight & Daza 318, MOL
urn:lsid:ipni.org:names:77160201-1:1.2.1.2 b1 Moonlight & Daza 318, E
urn:lsid:ipni.org:names:77160201-1:1.2.1.2 b2 Moonlight & Daza 318, MO
urn:lsid:ipni.org:names:77160201-1:1.2.1.2 b3 Moonlight & Daza 318, USM

IPNI stores names for type specimens, not links :(. In an ideal world these types would be linked to something, such as GBIF or a URL to the natural history collection's web database. Looking at GBIF's records for this species GBIF: 9451157 we can get links for some of these types, e.g. "Moonlight & Daza 318, E" is occurrence https://www.gbif.org/occurrence/1305189344 (see also http://data.rbge.org.uk/herb/E00785221 ). So to fully answer this question we need to have the occurrence information in RDF, and create a link between IPNI and the specimen identifiers. But this is not a limitation of TCS, so +1.

Summary

These are just some preliminary notes, and the examples are pretty trivial, but I think it might provide a way forward. If we can explicitly state what it is we want to do, have some examples using "real" data (which may be from an existing provider, or we may have to create some data sets) then I think we will be able to more clearly define what it is we're after, and whether my claim that the existing TCS is all we need is, indeed, correct.

baskaufs commented 6 years ago

Here is a dataset that might be useful to play with:

Agricultural Research Council: Catalogue of Afrotropical Bees. http://doi.org/10.15468/u9ezbh Accessed via http://www.gbif.org/dataset/da38f103-4410-43d1-b716-ea6b1b92bbac on 2016-10-26

It includes many possible pieces that could be connected using the existing TCS model. It is one of the datasets I played around with and described in this blog post in the section called "Taxon core with Occurrence, TypesAndSpecimen, Distribution, Reference, and Description extensions: Catalogue of Afrotropical Bees". There were a few additional comments about the dataset in the following post.

In my messing around, I used some of the TCS properties in my graph model (described in the post). The triples (as RDF/Turtle) can be downloaded here, but since my purpose was to use as many DwC terms as possible rather than to fully implement TCS, the dataset should probably be re-mapped to more fully embody the TCS graph model. (The mapping files that I used are here but probably won't make sense to anyone who hasn't already messed with Guid-O-Matic.) I don't have time to try re-mapping it myself right now, but if this line of inquiry continues for long enough, I might be able to work on it in a few weeks.

Oh, ho! I see that I put an example record here!

hlapp commented 6 years ago

@rdmpage 👍 to your proposed approach. It brings us firmly back to first defining concretely what the competency questions are (such as in the form of queries and expected results), and then determining the ontology that can satisfy them.

I'm also a firm believer in Occam's Razor, so IMHO the ontology we should be looking for is the simplest one that satisfies the competency questions, not a more elaborate one, whether driven by philosophy or moral objectives.

rdmpage commented 6 years ago

Names Test 3: Is a scientific name a homonym (either within a Code or across Codes)?

Here we test one of the classical "hemihomonyms", that is, a name which occurs in two Codes. Agathis montana is both the name of a wasp and the name of a tree. So, a simple query would be to see how many Codes have a given name:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tcom: <http://rs.tdwg.org/ontology/voc/Common#>

SELECT *
WHERE { 
  ?thing tn:nameComplete "Agathis montana" .
  ?thing tn:nomenclaturalCode ?code .
} 

Try it

giving:

thing code
urn:lsid:ipni.org:names:92693-1:1.1.2.1.1.1.2.1.1.1 http://rs.tdwg.org/ontology/voc/TaxonName#botanical
urn:lsid:organismnames.com:name:1407520 http://rs.tdwg.org/ontology/voc/TaxonName#ICZN
urn:lsid:organismnames.com:name:1953681 http://rs.tdwg.org/ontology/voc/TaxonName#ICZN

Note that we have two zoological names because ION (the source of the names) has two records for Agathis montana (the same problem bedevils IPNI). So, we need to be a little cleverer:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tcom: <http://rs.tdwg.org/ontology/voc/Common#>

SELECT (COUNT(DISTINCT ?code) AS ?count)
WHERE { 
  ?thing tn:nameComplete "Agathis montana" .
  ?thing tn:nomenclaturalCode ?code .
} 

Try it%0AWHERE+%7B+%0A++%3Fthing+tn%3AnameComplete+%22Agathis+montana%22+.%0A++%3Fthing+tn%3AnomenclaturalCode+%3Fcode+.%0A%7D+%0A%0A&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=https%3A%2F%2Fkg-fuseki.sloppy.zone%2Ftc%2Fquery&requestMethod=POST&tabTitle=Query&headers=%7B%7D&outputFormat=table)

This query asks how many distinct Codes contain Agathis montana, and the answer is:

row count
1 2

So, two codes have Agathis montana so it is a cross-Code homonym. TCS +1

Testing for homonyms within a Code is going to get a little messy given the number of duplicates some data sources contain, so we might want to test using publications, taxon authorship, or, in an ideal world, type specimens.

rdmpage commented 6 years ago

Names Test 5: What objective (Code-governed) synonyms exist for a scientific name?

One way to tackle this is if the name database has basionym relationships. IPNI and IndexFungorum do (although probably not complete).

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tcom: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT *
WHERE { 
  ?thing tn:nameComplete "Agathis montana" .
  ?thing owl:versionInfo ?versionInfo .
  BIND(IRI(REPLACE( STR(?thing),CONCAT(":", ?versionInfo),"" )) AS ?iri). 
  {
    ?name tn:hasBasionym ?iri .
    ?name tn:nameComplete ?nameComplete .
  }
} 

Try it)+AS+%3Firi).+%0A++%7B%0A++++%3Fname+tn%3AhasBasionym+%3Firi+.%0A++++%3Fname+tn%3AnameComplete+%3FnameComplete+.%0A++%7D%0A%7D+%0A%0A&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=https%3A%2F%2Fkg-fuseki.sloppy.zone%2Ftc%2Fquery&requestMethod=POST&tabTitle=Query&headers=%7B%7D&outputFormat=table)

This query is a mess because IPNI's RDF is, in a word, buggered. They use a version identifier for the name, which makes cross linking within the data almost impossible. A great example of what happens when you design outputs without thinking about users (sigh). So we have to mess about with the name id to get the query to work. The query also works only in one direction (i.e., what names have the query name as their basionym), we'd want to go in the other direction as well (what are the names linked to the basionym of the query name) but IPNI's RDF prevents this. IndexFungorum is probably OK for this sort of query. ION is clueless about basionyms, so zoologists miss out.

Here's the result:

thing versionInfo iri name nameComplete
urn:lsid:ipni.org:names:92693-1:1.1.2.1.1.1.2.1.1.1 1.1.2.1.1.1.2.1.1.1 urn:lsid:ipni.org:names:92693-1 urn:lsid:ipni.org:names:77076253-1:1.2 Salisburyodendron montanum

So, Salisburyodendron montanum is an objective synonym of Agathis montana TCS +1, IPNI -1

rdmpage commented 6 years ago

Taxonomy Test 6: How do the circumscriptions of the same scientific name by two different authorities compare to each other?

This one is for @nfranz, taken from Fig. 1 from https://doi.org/10.1093/sysbio/syw023 where we have two taxon concepts both named Microcebus murinus.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tcom: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX tc: <http://rs.tdwg.org/ontology/voc/TaxonConcept#>

SELECT *
WHERE { 
  VALUES ?namestring { "Microcebus murinus" }
  ?concept1 tc:nameString ?namestring .
  ?concept1 tc:accordingToString ?accordingto1 .

  ?concept2 tc:nameString ?namestring .
  ?concept2 tc:accordingToString ?accordingto2 .

  ?relationship tc:fromTaxon ?concept1 .
  ?relationship tc:toTaxon ?concept2 .
  ?relationship tc:relationshipCategory ?relationship_type .

  FILTER(?concept1 != ?concept2)
} 

Try it

This gives this result:

namestring concept1 accordingto1 concept2 accordingto2 relationship relationship_type
Microcebus murinus http://kg-fuseki.sloppy.zone/tc/1993_Microcebus_murinus MSW2 http://kg-fuseki.sloppy.zone/tc/2005_Microcebus_murinus MSW3 http://kg-fuseki.sloppy.zone/tc/1993-2005 http://rs.tdwg.org/ontology/voc/TaxonConcept#Includes
Microcebus murinus http://kg-fuseki.sloppy.zone/tc/2005_Microcebus_murinus MSW3 http://kg-fuseki.sloppy.zone/tc/1993_Microcebus_murinus MSW2 http://kg-fuseki.sloppy.zone/tc/2005-1993 http://rs.tdwg.org/ontology/voc/TaxonConcept#IsIncludedIn

So the 1993 concept of _Microcebusmurinus is a larger taxon than the 2005 concept _Microcebusmurinus . So TCS+1. Note that we could also express these relationships using the RCC5 terms in http://openbiodiv.net/

ghwhitbread commented 6 years ago

Very nice Rod. But these are competency questions for an information system designed to look and behave, much like TCS. Systems like APNI, AFD, IPNI, ITIS, CoL+, etc. … the TDWG ontology, TCS itself. We’ve had SPARQL services running off tn:views over APNI/APC and AFD for the past 8 years (currently disabled for system migration, sorry) with almost zero interest. Unusable by most clients, shunned by aggregators. Maybe it was just a sign of the times, and its yet to have its day. I'm still hopeful. For RDF at least, the power of Linked Open Data to simply implement complex services - like Taxon Name resolution, and for queries across datasets for example - has been well demonstrated. Though for TCS, we now use a local NSL model.

Like most contributors to this discussion we are custodians/developers of existing infrastructure and the question as to how we might model the domain is by now already well determined (for this current iteration). We have offered TCS and the TDWG ontology for export for many years but clients generally need to do what we do with these data and that is just not possible using any these standards. Delivery is always a compromise. Loss of information, a high barrier for understanding, the lack of adequate semantics, inappropriate generalisations, the “name”, “taxon”, “taxon concept” align/argument ... all contribute to a very poor standing on the reusability index.

Reusability, Interchange, knowing that the data delivered will be reasonably well understood, and represented correctly when it shows up elsewhere. These are the competency questions we are looking for now. A vocabulary for names and classifications, enabling lossless interchange of data (import, export) and good support for their discovery and extract.

At one level, between systems, for users like @rdmpage, a TCS+2 will very likely be the go. But when we deliver data it more often goes to support the taxonomic process, or into local lookup services, reused in controlled vocabularies, for checklist maintenance - into systems that work with the names of taxa. I would like to think that both use cases are possible with a TCS2 modelled as an application profile/ontology over a basic TDWG Names and Trees vocabulary.

rdmpage commented 6 years ago

Thanks Greg, I think there are two things here.

From my perspective the failure of previous attempts rests on several things: the expectation that users would use multiple SPARQL endpoints, the poor quality of the RDF (most of it not linked in any meaningful sense, just an RDF serialisation of data silos), the lack of rich content that people actually want (e.g, the absence of the literature), etc. I would argue that if we create properly linked data we can build rich clients on top of a centralised SPARQL server. My GBIF challenge entry is a proof of concept https://ozymandias-demo.herokuapp.com and this is built on the LSID TCS vocabulary (supplemented by a vocabulary @frmichel that handles things TCS makes awkward to do) and http://schema.org

What isn't clear to me is whether the previous failures are due to:

  1. limitations or complexity of TCS (is TCS comprehensible, does it do what we want?)
  2. limitations in the available data (we have lots of RDF for names, most of it problematic and weakly connected, if at all)
  3. insufficient interest in the problem TCS was meant to solve (people have created massive, heavily used databases without TCS. maybe it's not actually needed?).

You write:

Reusability, Interchange, knowing that the data delivered will be reasonably well understood, and represented correctly when it shows up elsewhere. These are the competency questions we are looking for now. A vocabulary for names and classifications, enabling lossless interchange of data (import, export) and good support for their discovery and extract.

For the sake of argument I'm asserting that if we used TCS and had properly described and linked data, we could do all this. Note that I'm not saying that I necessarily believe this, I'm simply asking whether it's possible. In other words, if we have good tools and documentation based on TCS can we achieve the goals you outline?

mdoering commented 6 years ago

Can we agree to refer to the ratified standard which is an XML Schema as TCS and to the TCS ideas ported to RDF as the TDWG Ontology? I find this confusing.

rdmpage commented 6 years ago

@mdoering Apologies for any confusion, my interest in this topic dates from the LSID discussions of 2005 onwards so I’ve never paid XML schema any attention, focussing instead on RDF. If using TDWG Ontology helps avoid confusion I’ll happily use that.

On 3 Oct 2018, at 21:56, Markus Döring notifications@github.com wrote:

Can we agree to refer to the ratified standard which is an XML Schema as TCS and to the TCS ideas ported to RDF as the TDWG Ontology? I find this confusing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/tnc/issues/7#issuecomment-426798907, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFFau1UNrbABrTb4ZhYD2je7DbcGpiIks5uhSSagaJpZM4XEZ3h.

nfranz commented 6 years ago

Re: https://github.com/tdwg/tnc/issues/7#issuecomment-426608213. Excellent, looks great. Thanks, @rdmpage

P.s.: In this paper https://www.researchgate.net/publication/252228152_Perspectives_Towards_a_language_for_mapping_relationships_among_taxonomic_concepts, page 9, Table 3, I listed a number of terms that in my view should mostly/somehow find their way into an updated TCS, because they are useful. For instance, with TCS2 we should be able to express, in the case of "splitting", that {2005.TCL1 + 2005.TCL2 + 2005.TCL3} == 1993.TCL4. Where (e.g.) the taxonomic name Microcebus murinus may participate both in TCL1 and TCL4.

rdmpage commented 6 years ago

@nfranz I confess my initial reaction to Table 3 is a concern that adding more and more terms to describe relationships risks making things more complicated than the need to be. For example, if a relationship can be derived by a query then does there also be a term for that relationship? That said, the “plus” and “minus” terms could be used to describe relationships between classifications in terms of tree edit operations, which strikes me as more economical than listing mappings. So, maybe having more terms will help support alternative mechanisms for describing taxonomic change.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Nico Franz notifications@github.com Sent: Wednesday, October 3, 2018 10:55:52 PM To: tdwg/tnc Cc: Roderic Page; Mention Subject: Re: [tdwg/tnc] TCS is just fine as it is (#7)

Re: #7 (comment)https://github.com/tdwg/tnc/issues/7#issuecomment-426608213. Excellent, looks great. Thanks, @rdmpagehttps://github.com/rdmpage

P.s.: In this paper https://www.researchgate.net/publication/252228152_Perspectives_Towards_a_language_for_mapping_relationships_among_taxonomic_concepts, page 9, Table 3, I listed a number of terms that in my view should mostly/somehow find their way into an updated TCS, because they are useful. For instance, with TCS2 we should be able to express, in the case of "splitting", that {2005.TCL1 + 2005.TCL2 + 2005.TCL3} == 1993.TCL4. Where (e.g.) the taxonomic name Microcebus murinus may participate both in TCL1 and TCL4.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/tdwg/tnc/issues/7#issuecomment-426817259, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAFFaqqdNCW_ER1duM5MqKYlZ2ALCakpks5uhTJogaJpZM4XEZ3h.

nfranz commented 6 years ago

@rdmpage - thanks. A counter point here would be that that all these terms are spatial, and hence compatible with and informative for spatial logic reasoning. Some are shortcuts for convenient human use, yes, and not representing them would be fine for reasoning purposes.

Another way of saying this: the terms give someone an opportunity to "creatively" assert regions of congruence between classifications where such instances of congruence may not be very obvious. To paraphrase an example: "Take away one concept in classification 1 from this parent and add it to that parent, and then you have congruence otherwise with classification 2". Maximizing opportunities to express congruence (RCC-5: ==), in turn, allows reasoning approaches to be maximally "greedy" in terms of deducing other spatial relationships between classifications through transitivity rules. In that context, it helps to have a more diverse relationship vocabulary.