tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
205 stars 70 forks source link

Description of a taxonomic entity in RDF #359

Open tfrancart opened 3 years ago

tfrancart commented 3 years ago

This extract from https://dwc.tdwg.org/rdf/ §2.7.4 leaves me skeptical :

The consensus embodied in the TDWG Taxon Concept Transfer Schema (TCS) standard is that identification instances refer to taxon concept instances. Therefore it would be a best practice to describe taxonomic entities in RDF as taxon concepts sensu TCS. However, because the TCS standard is an XML schema, it is not directly translatable to RDF. It is considered to be out of the scope of this document to specify how taxon concepts should be rendered as RDF. Nevertheless, Darwin Core does define many convenience terms listed under the dwc:Taxon class that can be used as properties of dwc:Identification instances (Section 3.5).

It might be argued that these convenience terms would more appropriately be properties of a dwc:Taxon instance. However, the object properties necessary to relate dwc:Taxon instances to name entities, references, parent taxa, and child taxa do not exist and the exact relationship between taxonomic entities such as taxon concepts, protonyms, taxon name uses, etc. has not been established using RDF. So the creation of functional dwc:Taxon instances described using RDF is not possible at the present time. Therefore this document establishes the convention that convenience terms for taxonomic entities should be properties of dwc:Identification. The task of describing taxonomic entities using RDF must be an effort outside of Darwin Core.

If I understand properly what is written here:

  1. No object properties exist to describe Taxon
  2. Therefore the (string) properties to describe Taxon as defined in DWC, should not be used to describe Taxon when using an RDF serialization, but instead be used to describe an Identification

I don't understand the logical entailment between the 2 propositions above. Why would the absence of defined object properties to describe Taxon prevent the use of string properties to describe a Taxon ?

Besides, as DWC properties do not define a range, why couldn't I use them as object properties ?

I don't understand why, as a user of an RDF serialization, I should be forced to make different choices of properties-classes associations than a user of an XML-based serialisation.

See related discussion about domain/range specification of properties here : https://github.com/tdwg/dwc/issues/357

Sorry if these remarks are obvious or out of scope here; I come from RDF/OWL world, am pretty new to DWC and I try to sort things out on how to best use DWC in RDF. Please also direct me to other channels if this is not the best place to raise these topics.

baskaufs commented 3 years ago

I think the underlying answers to these questions here has to do with the standards process as it exists in TDWG. As an RDF user, you can use Darwin Core properties to say anything you want. But if you don't use them the same way as others, nobody will understand what you mean. It would be the job of TDWG to define how those properties should be used in a stable way that makes sense to most possible users rather than just one person. As I've said in other responses, using Darwin Core terms in RDF was an afterthought. So when we wrote the RDF guide, a question in our mind was "can we describe a way to use these terms that is not likely to change in the future?"

Once a normative change to a standard is adopted, the Maintenance Group is required to assess all future changes to determine if they will disrupt the stability of the standard. If terms are required to be used in a certain way and the Maintenance Group changes that, things will break. (This is the "stability" requirement discussed in section 3.1 of the Vocabulary Maintenance Specification.) So standards maintainers are reluctant to adopt changes to a standard that are not likely to be stable. In the case of the Taxon class terms, at the time the RDF Guide was created, there had not been enough work done on modeling taxa/taxonConcepts/TNUs for there to be a consensus on how they should be described in RDF. So it seemed best to not attempt to prescribe how Darwin Core terms should be used in that way, given that those terms were really designed with tabular data users in mind. If we had tried to hack together a way to use those terms in RDF, it probably would not have been stable.

Since the RDF Guide was written a task group has been formed to create a robust model for taxa/taxonConcepts/TNUs. You can find there work here and here. If you are interested in this modeling work, the task group is open to anyone to participate.

The idea of "convenience" properties is described in detain in section 2.7 of the RDF Guide, so I won't go into it here. But the main point is that certain sets of properties in Darwin Core are not intended to be used to create descriptions of resources. Rather, they are intended as aids for searching.

For example, imagine that I have a table with a description of 1000 insects. In each row, I provide a value of "Insecta" for dwc:class and "Arthropoda" for dwc:phylum. Is my intention really to describe the relationship between the class Insecta and the phylum Arthropoda 1000 times? That's silly, one person only needs to do that one time. The reason I include those values is so that when someone is searching records in GBIF or some other aggregator, they can easily search for insect or arthropod records.

So indicating that the string-valued Taxon terms should be used with an Identification instance is a hack to get around the fact that we don't yet have a system for robustly defining taxa in RDF. If we provide a bunch of literal values for taxon-related convenience terms, what we are really doing is saying, "I'd like to link to a permanent IRI of a taxon that's described well in RDF, but since I can't because it doesn't exist yet, here are a bunch of search terms that you could use to find it in the future if someday it is created." The term dwciri:toTaxon was provided in the RDF Guide to make this linking possible at some point in the future.

tfrancart commented 3 years ago

Thank you Steve for the detailed explanation.

My use-case involves trying to map a description of Taxon and taxon names to DWC, in RDF (https://www.sandre.eaufrance.fr/urn.php?urn=urn:sandre:dictionnaire:APT:FRA:::ressource:2.1:::pdf, page 45 and 46 for UML diagrams). I understand from your explanation that this is simply outside of the scope of Darwin Core in RDF - correct ? (as you said, I could always do it, but I would not be conformant to DWC, and this is not what I want).

But if I was to try to map the same model to DWC in XML, it would be in scope of DWC - correct ? (The XML guide at https://dwc.tdwg.org/xml/ shows example of dwc:Taxon XML elements with dwc:scientificName, family, order, class, genus, etc.)

So the same set of terms, when used in different serializations, have different usage rules ?

Some have tried to embed DWC in JSON-LD in their webpages to describe Taxon, like https://inpn.mnhn.fr/espece/cd_nom/20704 (look at source starting at line 195 - I think the JSON-LD is incorrect and mixes DWC with schema.org, but that's not the point here). This is not consistent with using DWC in RDF - correct ?

If my understanding is correct (thanks to your explanations !) then that's a major pitfall in using DWC.

baskaufs commented 3 years ago

I think that it would be great for you to bring your use case to the Taxon Names and Concepts task group (@nielsklazenga is the convener). The kind of modeling you are trying to do is similar to what they want to enable with the development of that standard, and I believe that a robust RDF model is more likely to come from that group than Darwin Core. I think it has not yet been determined exactly how the new TNC standard would be used together with Darwin Core, but I think your use case wouldn't really need Darwin Core if you had the new TNC standard.

I'm familiar with the attempt to use schema.org and LSON-LD that you mentioned. My feeling is that their approach is more of a "quick and dirty" Linked Data approach (to make data available easily to clients that "understand" schema.org) rather than a robust Semantic Web approach that depends on careful modeling. The places in their JSON-LD where I see them actually using dwc: namespace terms are situations where there are literal values for non-xID terms. As far as I know, that's a valid use of those terms according to the DwC RDF guide. The other relationships they describe use terms in the schema: namespace and I'm not sure how strictly they are defined. I suppose they are using them correctly since I think they are on the team that developed those terms. I don't think there are any problems with mixing Darwin Core terms with those from other namespaces like schema.org.

nielsklazenga commented 3 years ago

@tfrancart, I think you are right that there is no reason why you could not use dwc:Taxon with RDF; it is just not endorsed by the Darwin Core RDF Guide (or TDWG), for the reasons mentioned in the Guide and by @baskaufs above.

We have just had a Task Group approved to produce a new version of TCS, which will be written up as a vocabulary standard just like Darwin Core. Our repository should be up in another week or so. We'll be most happy to have your input on how to link everything up.

In the meantime, this publication by Viktor Senderov and others might be useful. Also our earlier discussions in the TNC Repository.