tdwg / dwc-qa

Public question and answer site for discussions about Darwin Core
Apache License 2.0
49 stars 8 forks source link

Should WoRMS LSID be the value of dwc:taxonID or dwc:scientificNameID in Occurrence core/extension? #203

Open ymgan opened 11 months ago

ymgan commented 11 months ago

Hi,

I have been on this quest for a while now because our project team is tasked to align OBIS quality checks with the Core Tests and Assertions from TDWG BDQ TG2. I talked to folks from GBIF Norway, GBIF Helpdesk, OBIS Secretariat, WoRMS, TDWG BDQ TG2 and a couple of GBIF/OBIS nodes specifically about this question, but the answers I got are all different. It is very frustrating to me when there is no consensus between the opinions. Hence I am opening this issue, summarizing what I understood and it would be great if we could find a consensus and solution together.

edit: This issue is talking specifically about the usage of WoRMS LSID in Occurrence

As some of the discussions take place through Slack, emails or in-person, I could not link all of the information as GitHub issue here. Please correct me if I am wrong in any sense.

Definitions

dwc:taxonID

An identifier for the set of dwc:Taxon information. May be a global unique identifier or an identifier specific to the data set.

dwc:Taxon

A group of organisms (sensu http://purl.obolibrary.org/obo/OBI_0100026) considered by taxonomists to form a homogeneous unit.

dwc:scientificNameID

An identifier for the nomenclatural (not taxonomic) details of a scientific name.

Why OBIS recommends using dwc:scientificNameID field for WoRMS LSID?

From the response I received from WoRMS helpdesk (it was a thread migrated from OBIS Slack), there are a couple of reasons:

  1. It is the easiest to be understood by data providers and managers who need to populate the field.
  2. The argument that the information received by WoRMS really is a scientificName. WoRMS LSID is 1-1 linked to the name provided regardless of whether the taxon is accepted or not. When WoRMS LSID is provided as scientificNameID, people can find their way to WoRMS where taxonomic status, current accepted name and other taxon information is documented.
  3. OBIS hopes to mitigate the burden to keep track of of names, taxonomic status and synonyms (which may change over time) by recommending the use of WoRMS LSID dwc:scientificNameID (having WoRMS track those changes) and leaving out any Taxon related field. -- this is from this comment

Can WoRMS LSID be used for dwc:taxonID?

Opinion from WoRMS

The response I received from WoRMS helpdesk (via email) is that taxonID is an identifier for a taxon concept and not a name. WoRMS does not have such concept. A remark about marine community links observations to names, not to concepts was also made.

This made me wonder if there is a confusion between dwc:taxonID and dwc:taxonConceptID?

Implementation concern from GBIF

GBIF Helpdesk once responded that it may not be a good idea to have WoRMS LSIDs as taxonIDs because they are not stable. TaxonIDs in the GBIF context should be identical between versions of the dataset, and they could potentially change if they come from unstable LSIDs.

The stability concern - I believe - is referring to WoRMS does not have stable identifiers for taxon concepts.

Please see more in the comments:

Why WoRMS LSID should be used for dwc:taxonID?

WoRMS is not an authoritative source of information on nomenclatural acts

This is perhaps the biggest argument I received when comes to WoRMS LSID should not be used for dwc:scientificNameID field. @chicoreus mentioned in the comment that the definition for dwc:scientificNameID is explicitly pointing at an authoritative source of information on nomenclatural acts, nomenclators. Since WoRMS is not an authoritative source of information on nomenclatural acts, it is not appropriate to use dwc:scientificNameID for WoRMS LSID. @mdoering also mentioned the concern in this comment.

dwc:taxonID is an identifier without a particular meaning to the instance of the Taxon class

Following @chicoreus comment which aligns well with the Darwin Core definition for dwc:taxonID:

It is an identifier for the package of information associated with a Taxon class, without linking a particular meaning (name string, nomenclatural act, taxon concept, taxon concept including classification) to the instance of the Taxon class. The dwc:taxonID serves as the identifier for the set of information in the terms in a dwc:Taxon instance, without applying additional semantics to the dwc:Taxon instance.

My perspectives as a data manager for both GBIF and OBIS node

Difference in interpretation leads to difficulty in collaboration

It is VERY difficult for me as a data manager for both GBIF and OBIS node when there are differences in interpretation in whether WoRMS LSID should be populated under dwc:taxonID or dwc:scientificNameID. One example is I had this conversation when I attended a workshop organized by Nansen Legacy and GBIF Norway. GBIF Norway thinks that WoRMS LSID should be populated under dwc:taxonID, but OBIS and WoRMS insisted that it should be populated under dwc:scientificNameID with the reasons stated above. Furthermore, dwc:scientificNameID is a mandatory field for OBIS. I appreciate that @pieterprovoost was being pragmatic and mentioned that he will look for solution, such as parsing dwc:taxonID in OBIS data processing. The data could be interpreted better if there is a consensus here.

Implications for the future

The new unified data model

I really hope we could find a consensus now than having this carry over to the new data model (see screenshot below) Screenshot taken today 2023-07-20.

Screenshot 2023-07-20 at 15 17 53

I am aware of this is an immature state of the model. Based on my email conversation with WoRMS, the same issue seems to persist - to WoRMS, it makes sense to add observedScientificNameID to the ReportedAbundance table

My questions

Can we reach a consensus on whether WoRMS LSID should be used for dwc:taxonID or dwc:scientificNameID?

Right now the standard seems to suggest that dwc:taxonID should be used for WoRMS LSID, but the implementation side seems to suggest otherwise. So what exactly should a data manager like me do? This is so frustrating!

Is there anything unclear about the usage of dwc:taxonID, dwc:taxonConceptID or dwc:scientificNameID that should be improved in Darwin Core documentation?

If so, what is it? What leads to different interpretations between different people/organizations? If we could identify that, a term change request should perhaps be submitted.

Thank you

Thank you everyone who talked to me and helped me in understanding this in any way! I hope I summarized the issue well. I definitely am not the most tactful person, apology if I stepped on your ego. Please correct me if I said anything wrong!

mdoering commented 11 months ago

Basically I agree that WoRMS LSIDs are name identifiers and thus belong into dwc:scientificNameID, not taxonID or taxonConceptID.

DwC unfortunately still is inconsistent in its taxon/name/usage identifier documentation. dwc:Taxon IS NOT used as a taxon concept - at least in all checklist dwc archives I have seen and in all code I know that works with them. Contrary to the Taxon class definition it is used for name usages, i.e. taxa or synonyms. That is why there is also dwc:parentNameUsageID, dwc:originalNameUsageID and dwc:acceptedNameUsageID - all of which point to the dwc:taxonID, not dwc:scientificNameID or taxonConceptID.

dwc:taxonID is the primary key for the dwc:Taxon class, just as dwc:occurrenceID is for dwc:Occurrence. As the Taxon term name comes from the very early days of Darwin Core it was retained, although sth like NameUsage and nameUsageID would have been more appropriate.

When it comes to occurrence datasets though, you do not want to use taxonID or scientificNameID as a primary key of a checklist that uses other terms as foreign keys. All you want is to refer to an external definition for the name OR taxon concept. For both these GBIF recommends to use scientificNameID or taxonConceptID (even though GBIF does still not make much use of those, but that will change at some point).

@tucotuco I think the definition of dwc:Taxon class term really needs to be changed to align with the other taxonomic ID terms.

tucotuco commented 11 months ago

@tucotuco I think the definition of dwc:Taxon class term really needs to be changed to align with the other taxonomic ID terms.

Proposals for term changes are always welcome via the term change issue template.

bart-v commented 11 months ago

According to their definitions, there are elements of dwc:taxonID, dwc:taxonConceptID and dwc:scientificNameID attached to a WoRMS LSID. So, it's really a gray zone... But naturally it feels like dwc:scientificNameID is the preferred option & thus, I agree with @mdoering

So let's just standardize on dwc:scientificNameID

Both DwC and WoRMS might evolve during time, so we might need to update this.

ymgan commented 11 months ago

Thank you very much @mdoering and @bart-v !! I appreciate your comments on this.

dwc:Taxon IS NOT used as a taxon concept - at least in all checklist dwc archives I have seen and in all code I know that works with them. Contrary to the Taxon class definition it is used for name usages, i.e. taxa or synonyms.

When it comes to occurrence datasets though, you do not want to use taxonID or scientificNameID as a primary key of a checklist that uses other terms as foreign keys.

Thanks @mdoering !! These are not obvious to me at all! How would you suggest to update the definition of dwc:Taxon class?

On the other hand, is there a reason why scientificNameID is so restrictive (nomenclators only) ? I asked because I am thinking to propose a term change request for scientificNameID to broaden its scope to include identifiers for scientific name that are not from a nomenclator (e.g. WoRMS LSID). Does this sound sensible to you?

bart-v commented 11 months ago

... identifier for the nomenclatural ... details of a scientific name does not mean it can only contain identifiers from nomenclators. The definition just wants to say that it's not an identifier for anything taxonomy-related. So, I see no issue here.

baskaufs commented 11 months ago

I would like to get some input from the TCS Task Group on this subject (ping @nielsklazenga). When we wrote the DwC RDF Guide, it was understood that clarity was needed in defining what exactly a taxon was. The result was the normative content in Section 2.7.4 of the RDF Guide, which basically put off creating a robust definition of a taxon until TCS was revised as a current TDWG standard. The terms organized under the dwc:Taxon class were considered "convenience terms" that could be used in lieu of a robust model for taxa.

When the cleanup of dwc: and dwctype: classes took place in 2014, dwc:Taxon was not rigorously defined. Similarly, the property dwciri:toTaxon was created as a way to link out to some "taxonomic entity" to be more clearly defined at a later time:

The task of describing taxonomic entities using RDF will have to be an effort outside of Darwin Core. This guide does establish the object property dwciri:toTaxon for use in relating a Darwin Core identification instance to a taxonomic entity as defined elsewhere. (Section 2.7.4 of the RDF Guide)

In both of these situations, I think there was an assumption that the TCS revision would clear up the definitions.

It is my understanding that the TCS task group (upon whom the task defining a "taxonomic entity as defined elsewhere" has fallen) is in the final stages of work. I am assuming that they will provide more clarity about the relationships among "taxa", "taxon name usages", and names. I think it would be best to learn more about their conclusions before proposing changes to Darwin Core terms (which I am assuming will ultimately be adjusted as necessary to be consistent with their model).

nielsklazenga commented 11 months ago

It depends on whether the entries in WoRMS are meant to be taxa or names. For Catalogue of Life entries, I would use dwc:TaxonID, while for entries in Zoobank and IPNI, I would use dwc:scientificNameID. I suspect WoRMS will be more like the Catalogue of Life.

The two often get confounded though. Plants of the World Online uses the IPNI identifiers, but there they are dwc:taxonIDs and I reckon also in Catalogue of Life the URI stays the same if the definition of the taxon is changed but the name is not. But that is a matter of when to assign new identifiers, not what the identifiers stand for.

ben-norton commented 11 months ago

@nielsklazenga Can you provide a reference that describes the difference between taxon and names within this context?

dbloom commented 11 months ago

@nielsklazenga I would like to see a reference/source, too. That would be most helpful.

In addition, I just wish to lend my support and interest to this conversation. As a node manager who is publishing data to multiple aggregators, this issue of the use of the WoRMS LSID is of particular interest to my work and I would certainly appreciate some clarification to guide me and others to whom I provide training.

nielsklazenga commented 11 months ago

For taxon, see dwc:Taxon; for taxon name, see dwc:scientificName and/or dwc:vernacularName. Darwin Core does not have a class for taxon names and dumps properties of both taxa and names in the Taxon class – which is why the Taxon class is not suitable for use with RDF (Darwin Core RDF Guide) – but TCS and the TDWG Ontology both have both TaxonConcept (which is equivalent to the Darwin Core Taxon) and TaxonName classes.

To put it as simply as it really is, if a taxon is a skos:Concept (which it is, cf. Senderov & al., 2018). a taxon name is a skosxl:Label. It is not more complicated than that.

We hope to bring TCS 2 to public review around the time of TDWG 2023. In the meantime, the Catalogue of Life Data Package (CoLDP) is really good and, if you use the schema with the Taxon table, completely TCS compliant. In the CoLDP schemas, Taxon is the tcs:TaxonConcept, while NameUsage is the dwc:Taxon.

Going back to Occurrence data, which I sort of missed yesterday you guys were talking about, and assuming that WoRMS is taxonomic data and not Occurrence data, the only appropriate field to provide the WoRMS LSID would actually be dwc:taxonConceptID. dwc:taxonID is an internal identifier, so should (in my opinion at least) only be used in Taxon Core datasets, while dwc:TaxonConceptID references an external taxonomy, like WoRMS. This is assuming, however, that all Identifications use the WoRMS concept. Depending on the dataset, this can be a pretty big assumption to make. For a museum or herbarium collection dataset, for example, if the Identification itself does not already reference a concept (as, e.g, 'Mytilus edulis sec. WoRMS 2004') for a maintainer of the dataset to add a dwc:taxonConceptID to an Identification amounts to making a new Identification and blaming somebody else for it.

ghwhitbread commented 11 months ago

A few observations (and an alternative take), Assuming this issue is about the publication of OBIS using Darwin Core (not Occurrence core/extension):

nielsklazenga commented 11 months ago

I agree with everything Greg says. I still say that dwc:taxonConceptID is the answer to the question which field is the most appropriate to put the WoRMS LSID in, but that does not mean it is appropriate to put the WoRMS LSID in primary Occurrence data at all. As the WoRMS LSID most likely reflect nominal concepts (cf. Franz & Peet, 2009), adding them as either scientificNameID or taxonConceptID is probably harmless, but it is also pointless and I would not consider it good practice.

ymgan commented 11 months ago

Thank you SO MUCH everyone who commented above!! I felt that there are some misunderstandings. Please allow me to clarify a couple of things:

What happens to a record in WoRMS and its LSID when there is a name change?

Screenshot 2023-08-03 at 14 58 15

WoRMS = World Register of Marine Species (https://www.marinespecies.org/). When there is a name change, WoRMS creates a new record for the new name and the two records point to each other (please see figure above). Same for their children (if there is). Hence, the taxonomic status changed (accepted/unaccepted) but LSID associated with the name remains the same. In other words, WoRMS LSID has 1-1 relationship with each name.

@nielsklazenga does this answer your question below?

It depends on whether the entries in WoRMS are meant to be taxa or names.

Inconsistency between definition and application for dwc:Taxon

I hope I illustrate @mdoering comment correctly in the figure below. I believe this is the reason why people are confused about how taxonID should be used and why we are looking for a dwc:Taxon definition.

I thought the purpose of having data standard is so that everyone can use it in the same way, but I felt that this is not the case. If we data publishers are having difficulties with this, it will be even more challenging for end users who use the data we published.

Screenshot 2023-08-03 at 14 26 26

I still say that dwc:taxonConceptID is the answer to the question which field is the most appropriate to put the WoRMS LSID in, but that does not mean it is appropriate to put the WoRMS LSID in primary Occurrence data at all.

@nielsklazenga I am sorry, I am not understanding what this sentence mean. It is the most appropriate but not appropriate?

I am asking for the right field to use for WoRMS LSID for Occurrence core/extension

Thank you @ghwhitbread ! My bad for not clarifying at the beginning. OBIS is an aggregator like GBIF which harvests Darwin Core Archive published via IPT. GBIF uses GBIF taxonomic backbone and OBIS uses WoRMS as its taxonomic backbone. I am NOT talking about species checklist, which OBIS does not deal with. I am talking about the Occurrence core/extension of primary observations data that I publish which I would like to use WoRMS LSID as an external identifier for the scientificName of an Occurrence record.

Screenshot 2023-08-03 at 15 00 07

I hope this is clear! Thanks again!

nielsklazenga commented 11 months ago

Hi @ymgan, the fact that there is a one-to-one relationship between the WoRMS LSID and names does not make the LSID an identifier a for a Name, but an identifier for a Nominal Concept (Franz & Peet, 2009), basically a taxon for which you do not know the definition exactly. At any given point in time an entry (with an accepted name) in WoRMS is a Relational Concept, i.e. you can infer the definition from the context (its siblings), but over a longer period of time it is a Nominal Concept, because the context may have changed. This is not so much about what type of object an entry is, but all about how identifiers are managed, which is probably the biggest hurdle to making taxonomy really work. There was a question about this in the Catalogue of Life repo. (CatalogueOfLife/general/98) just yesterday. This is an issue for WoRMS (and a lot of other systems out there, including my own) though, not OBIS, or occurrence data.

I am not sure if this answers your next question, but I like the way GBIF does it. In GBIF, dwc:scientificName is used for the Identification, while dwc:kingdom, dwc:phylum, dwc:class, dwc:order, dwc:family, dwc:genus, dwc:specificEpithet and dwc:specificEpithet are used for the Taxon in their taxonomy to which the Identification is matched. In this scenario the WoRMS LSID goes into dwc:taxonConceptID. I think it is important to keep this strict distinction between the provided data and the inferred data. The pedant in me would not call the latter occurrence data.

bart-v commented 10 months ago

FYI: WoRMS puts much effort into keeping the identifiers stable: we are managing them manually, so we will never create a new identifier for an existing name/concept/taxon in WoRMS. If it would happen anyway, we will keep the oldest identifier, and mark the duplicate as ReplacedBy

stanblum commented 10 months ago

Rich Pyle has given two talks on the subject at TDWG meetings.

If you'd like to read about it instead, the first formal publication of the ideas from a biodiversity informatics perspective is probably Berendsohn, Walter G. 1995. The Concept of Potential Taxa in Databases. Taxon 44(2). https://www.researchgate.net/publication/247816280_The_Concept_of_Potential_Taxa_in_Databases

On Tue, Aug 1, 2023 at 11:09 AM Ben Norton @.***> wrote:

@nielsklazenga https://github.com/nielsklazenga Can you provide a reference that describes the difference between taxon and names within this context?

— Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc-qa/issues/203#issuecomment-1660840162, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKZUDMIUU3WRC7IPILX6ULXTFA5JANCNFSM6AAAAAA2RO454Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mdoering commented 10 months ago

the fact that there is a one-to-one relationship between the WoRMS LSID and names does not make the LSID an identifier a for a Name, but an identifier for a Nominal Concept (Franz & Peet, 2009), basically a taxon for which you do not know the definition exactly. At any given point in time an entry (with an accepted name) in WoRMS is a Relational Concept, i.e. you can infer the definition from the context (its siblings), but over a longer period of time it is a Nominal Concept, because the context may have changed. This is not so much about what type of object an entry is, but all about how identifiers are managed, which is probably the biggest hurdle to making taxonomy really work.

I would argue the opposite. How can a WoRMS LSID refer to a Taxon if it links to a synonym? How can a taxon identifier change its content to from an accepted name to a synonym? If the taxonomic concept changes, but the identifier does not, it is a very bad taxon identifier. On the other hand if the identifier perfectly stays with the name over time, no matter where in the hierarchy it is placed or if it is accepted, then it is a pretty good name identifier and belongs into dwc:scientificNameID.

The vast majority of our taxonomic systems work with name based identifiers even though this is not clearly stated. This is true for GBIF, ITIS TSNs, Catalogue of Life, WoRMS, WFO, FaunaEuropaea and many more. Avibase, Dyntaxa and iNaturalist are exceptions.

ghwhitbread commented 10 months ago

+1 for scientificNameID (again) … and whether or not it is supposed to be taxon or name (or both), OBIS still uses WoRMS as the source of nomenclatural details for a name.

Jegelewicz commented 10 months ago

OBIS still uses WORMS as the source of nomenclatural details for a name.

Proving that if a big enough user community is using a term some way, then that is how it should be used? I have to say this entire thread is nothing but confusing to me and I think it is because we have poor community-wide agreement on these terms and their meaning.

However, if I were publishing anything to OBIS, I would put the WoRMS LSID in dwc:scientificNameID because if I don't, I might be surprised to find that I am not publishing to OBIS.

ghwhitbread commented 10 months ago

Proving that if a big enough user community is using a term some way, then that is how it should be used?

Identifier | http://rs.tdwg.org/dwc/terms/scientificNameID Definition | An identifier for the nomenclatural (not taxonomic) details of a scientific name.