tdwg / dwc-qa

Public question and answer site for discussions about Darwin Core
Apache License 2.0
49 stars 8 forks source link

Field to record the name a taxon was originally recorded in an unprocessed/not QC-ed dataset #109

Open Daphnisd opened 7 years ago

Daphnisd commented 7 years ago

When a dataset is submitted to us the taxon names often include spelling variations. We standardize these names to the valid names from species registers. It's my understanding that the field scientificName should contain a valid taxon name. The question is which DwC field is most suited to store the pre-QC taxon name, which may have spelling errors, may be a combination of 2 different taxa in case the scientist thought the taxon could be one of 2 taxa, a common name for the taxon,... . Especially when the data is derived from digitization efforts of data from retired scientists storing the original name by which the occurrence record was recorded seems important to ensure traceability.

I thought maybe DwC:identification (https://github.com/tdwg/dwc-qa/issues/108) could serve for this, but I don't find this term in the occurrence core.

debpaul commented 6 years ago

Hi @Daphnisd - see http://rs.tdwg.org/dwc/terms/Identification

dagendresen commented 6 years ago

The identification history is organized as a DwCArchive-extension [1] (when publishing in the GBIF IPT). Notice that some of the dwc:Identification terms that you may use are also available in the DwCArchive-Occurrence-core - such as e.g. the dwc:identificationRemarks (available in both [1] and [2]). When you have an identification history, the DwCArchive-extension allows for recording a one-to-many relationship to the Occurrence record.

[1] http://rs.gbif.org/extension/dwc/identification.xml [2] http://rs.gbif.org/core/dwc_occurrence_2015-07-02.xml [3] http://rs.tdwg.org/dwc/terms/identificationRemarks

tucotuco commented 6 years ago

The Identification History extension is a good solution if you have more information than just the scientific name that was used, because the extension allows you to also record who made the identification ("determination" for botanists), when, with what references and so on - for multiple determinations. With or without the extension, you can (should) also list any names that were used in the field previousIdentifications [4]. If there is more than one such name, separate the names with "|" in that field. Though common names can also go in the previousIdentifications field, the field vernacularName [5] is meant to capture that information.

[4] http://rs.tdwg.org/dwc/terms/index.htm#previousIdentifications [5] http://rs.tdwg.org/dwc/terms/index.htm#vernacularName

Daphnisd commented 6 years ago

Thank you for the answers. However, I fear I may have confused matters by bringing up the identification term. I’ll try to elaborate in more detail. My question is the following:

As an OBIS node we receive datasets from different providers, data digitalization efforts,... and we process these data into DwC. It is my understanding that the term "scientificName" should contain a valid scientificName (meaning it has been published somewhere) and should therefore not contain spelling errors or other information which is not strictly scientific name. Old OBIS guidelines stated that scientificName should contain whatever name was originally provided. We then assign an Identifier from WoRMS through which OBIS can get the taxon related information (see https://www.iode.org/index.php?option=com_oe&task=viewDocumentRecord&docID=9174). I want to comply with the DwC definition and only populate scientificName with a valid (meaning a name which was published and thus occurs in WoRMS) scientific name (see current guidelines http://iobis.org/manual/darwincore/), but the question remains what to do when we receive spelling mistakes or names like the following:

Gadus? cf. Gadus Aphanizomenon and Oscillatoria Cladocera/Ostracoda Clauso/Cteno/Paracalanus Sponges Red algae Cat shark Common gull

We have good procedures as to which valid scientific name and WoRMS-ID to assign (they involve asking the provider what they meant, and asking to correct at their end, although in many cases, this is not possible e.g. because the specimens are gone, and what is recorded is all that is available) as you see in the guidelines above. However, as we may interpret something as a spelling variation and have assigned the correct intended name in 99% of the cases, it seems important for this original name to be kept, so one would be able to figure out what went wrong for the 1%. In some cases assigning the scientific name means losing some detail. Therefore, we would like to include the name, that was provided to us in a dedicated field. As we use IPT to get the data to OBIS, this field needs to be in the occurrence core / extension because e.g. the identification extension is not compatible with Event Core. This dedicated field would/could be filled out for each record of each dataset. In many cases, it would be the same information as the scientificName, but in the examples above it would not be.

Some terms were suggested above. previousIdentifications: this seems the most promising term to me, but common names is probably not their intended use. Also, we did not examine the specimen (the specimen is usually gone also), we did not re identify anything so stating that the name as we received is a "previousIdentification" is just strange/wrong?

vernacularName: I guess we could use it when the names provided are vernacular. However, the term does not specify that we receive this is the original term by which the observed specimen was recorded in the dataset and the scientificName is an (expert) interpretation based on this.

identificationRemarks: we could use this in each dataset and state something like "specimen originally recorded as: ...", but this doesn't seem ideal, especially as there may actual remarks related to the specimen.

I would think that this is an issue, which all data managers involved in ecological data face? How do others deal with this? I’m thinking we would need a term like “ScientificNameOriginallyRecordedAs” or “OriginallyRecordedAs”? The reason I first thought of Identification is because a generic term called "Identified" (meaning identifiedAs) would serve nicely too. Another option someone suggested is "scientificNameVerbatim".

ansell commented 6 years ago

I have so far used verbatimScientificName, matching the convention used by some other terms already in Darwin Core Terms.