[traits.build workflow] Add field for specimen identifiers

ehwenk commented 1 month ago

Add field(s) to map in specimen identifiers - such as for trait data linked to herbarium vouchers or trait data where the same specimen/individual is measured in multiple datasets.

Need to consult with ALA / GBIF to ensure we include the field(s) that are most used across global biodiversity databases. But possibly, we'll need 2 fields, one for the more generic instance of "same individual measured in different datasets" and a second more formally for herbarium vouchers.

ehwenk commented 1 month ago

One of our immediate aims is to add column(s) to the traits.build database structure that allows trait observations to be linked to herbarium records or in instances when a dataset collector has a unique record number that links across trait observations in multiple datasets.

We want to be fully compliant with the DwC standard, but also minimise the number of additional fields we add to traits.build, especially as these fields will be blank for the majority of datasets.

Looking through DwC, it seems there are two distinct types of "identifiers" that probably need to be added:

A record number for casual links between observations. These are record numbers that link across datasets, but aren’t GUID’s. We’d most likely use Dwc:recordNumber, defined as, “An identifier given to the dwc:Occurrence at the time it was recorded. Often serves as a link between field notes and a dwc:Occurrence record, such as a specimen collector's number.”
An identifier that links to ALL herbarium vouchers, GBIF, etc. This will be either dwc:occurrenceID or dwc:catalogNumber although I think occurrenceID should already incorporate codes for the specific herbarium/collection, while catalogNumber would require that we also have columns for herbarium/institution (https://dwc.tdwg.org/list/#dwc_institutionCode) and maybe other details. On the other hand, within ALA, while the occurrenceID is part of the URL, it isn’t actually reported on the page.

dwc:occurrenceID (An identifier for the dwc:Occurrence (as opposed to a particular digital record of the dwc:Occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the dwc:occurrenceID globally unique.)
dwc:catalogNumber (An identifier (preferably unique) for the record within the data set or collection.)

I don't think the two identifier categories can be merged or we'd be diverging from the dwc meaning of each.

As examples, see this record in ALA, GBIF:

https://biocache.ala.org.au/occurrences/60455440-c777-43d9-9cc0-19354cbc8403

https://www.gbif.org/occurrence/2430993462

The AusTraits team set out as a goal to change traits.build as little as possible, but I think before we do this we should contemplate if there are any other “occurrence” metadata fields we should be adding as part of this – at the moment we don’t explicitly include the concept of an “occurrence” in the traits.build structure. It is implicit via observationID and an observations geographic location (latitude/longitude) that to observe an organism in a location, on a date, it must have occurred there.

A few relevant references:

Nelson G, Sweeney P, Gilbert E (2018) Use of globally unique identifiers (GUIDs) to link herbarium specimen records to physical specimens. Applications in Plant Sciences 6, e1027. doi:10.1002/aps3.1027.

Folk RA, Siniscalchi CM (2021) Biodiversity at the global scale: the synthesis continues. American Journal of Botany 108, 912–924. doi:10.1002/ajb2.1694.

ehwenk commented 2 weeks ago

Further research suggests dwc:institutionCode will also be required to uniquely link to observations/collections in the ALA, gbif, and other collections. For instance, for the Australian Museum, the catalog number does not include the institution code.

ehwenk commented 1 week ago

DarwinCore also has a field dwc:associatedSequences which allows one to link to one or more identifiers for genetic sequence information. This is a new DarwinCore addition as part of their MaterialEntity class.

ehwenk commented 1 week ago

Further thoughts with @dfalster

Identifiers will be in a separate relational table, linked back to the traits table via observation_id
The identifiers table will be in long format with columns observation_id, identifier, identifier_value and identifier_comments
All identifiers used will come from a controlled vocabulary (specified in schema), but this can include many of the various identifiers used by other biodiversity databases, genomics aggregators, etc.
All identifiers will be terms that are formally defined, generally in DarwinCore, but perhaps on occasion in other vocabularies

This will be easy to implement and has the advantages that:

We won't be adding many columns to the traits table
We can implement this enhancement without worrying that we haven't included all necessary identifiers/consulted with all traits.build users/biodiversity portal managers and will have to change the traits.build structure repeatedly.

traitecoevo / traits.build

[traits.build workflow] Add field for specimen identifiers #167