tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
203 stars 70 forks source link

New term - verbatimIdentification #181

Closed Daphnisd closed 3 years ago

Daphnisd commented 6 years ago

New term

Proposed new attributes of the term:

From Daphnis de Pooter (@Daphnisd) Term needed to record how a taxon was originally recorded in an unprocessed dataset. Motivation: https://github.com/tdwg/dwc-qa/issues/109

mdoering commented 6 years ago

I would suggest to review all "verbatim" terms and come up with a general strategy. In theory all terms can be interpreted and there is a need to deal with both verbatim and interpreted/cleaned values. GBIF or ALA for example have a long list of interpreted terms. We decided against new terms though and rather use the same term in a different context.

I can't find the issue now, but it was also suggested to have a new rowType that could indicate verbatim values.

cgendreau commented 6 years ago

@mdoering it overlaps with discussion in https://github.com/gbif/occurrence/issues/24

mdoering commented 6 years ago

Thanks @cgendreau thats exactly what I was looking for!

ansell commented 6 years ago

A new convention is my preference so far, per the discussion in https://github.com/gbif/occurrence/issues/24

claudenozeres commented 6 years ago

I agree with Daphnis that a new term and/or strategy is needed to make this more explicit. In the past for marine datasets, practice was to publish using a valid name from an interpreted original. Despite the recommendation to submit using the original (because it gets too messy, WoRMS can't always suggest matches for obvious names). Then the original (verbatim) information is lost. Currently this is a challenge for me with specimen label names (although I imagine it is similar for observation records). So the matter of original names sometimes gets mixed in with issues of identification. What I need is to record verbatim name. Displaying a valid scientificName comes after, because as it appears in the ALA/GBIF discussion, this can be open to interpretation. Having verbatim as part of a history extension (rather than core) does seems fragile to me.

mdoering commented 6 years ago

I would argue DwC should not have any specific verbatim terms but rather recommend other ways of dealing with data provenance. Often we also have longer lineages with multiple steps that alter the content so a single verbatim term is difficult to apply. For example W3C offers a rather complete PROV Ontology although we should probably look for sth far more simple.

qgroom commented 6 years ago

I tend to agree with Markus. Essentially every field could have a verbatim term and it would be better if we could chain versions of an observation together. My only doubt is that Darwin Core has to be kept reasonably simple otherwise people will not use it. Therefore, I'm OK with maintaining some verbatim fields as long as there was a gold standard way to handle these data.

mdoering commented 6 years ago

Well, we could also create a verbatim term for each dwc term. There is not really a restriction on number of terms, just increased complexity. But if verbatim terms always have a prefix "verbatim" its not adding much to the confusion. It might even help cause we could get the existing ones out of the way when presenting terms

qgroom commented 6 years ago

As Markus points out the problem is provenance. As data associated with an observation/specimen get amended the chain of provenance is lost if you only have one verbatim field. If I understand Markus correctly you could have no verbatim fields because every field would be verbatim and you would link versions together to determine provenance.

debpaul commented 6 years ago

From a data-mining standpoint, we need verbatim data to do things things like automatically find matching references between a dwc record and an old publication in BHL. If for example, the verbatim locality, verbatim taxon name, are not shared, then it will be much for difficult for computer algorithms to make the connections between the two datasets. I'm not sure it matters what we call it as long as it's clear that it's the "original text" in this case, as found in or on the label / field notebook / ledger. So it seems you are all saying we could / ought to use Identification History (and other such extensions) to share this type of information? What about verbatimCountry? this comes up all the time.

Daphnisd commented 6 years ago

I don't think using a separate extension for this is an option for us, as it would not be compatible with event core in IPT.

ansell commented 6 years ago

Adding a single "verbatim" extension to a Darwin Core Archive isn't going to satisfy every use case if provenance over time is required, but those use cases also won't be satisfied by a single verbatimScientificName field.

In a possibly more serious provenance case, the ALA created an issue for itself, GBIF, and the community, a number of years ago with its choice to overwrite the original occurrenceID obtained from scientists with an internal opaque GUID when sending this data to GBIF, but still shows the original occurrenceID on ALA websites/downloads and stores that in the ALA datasets. I have been told by the person who made that decision that it should/must not be fixed (for various reasons). However, without a standard way to express the verbatimOccurrenceID, I also can't provide any workarounds to enable the original data to passthrough unhindered.

Having a standardised way of providing one or more verbatim or historical Darwin Core Archive extension files would allow users to optionally read what the original author provided, or read what other evolutions of the record contained. The current GBIF-only convention only allows for a single verbatim extension based on a static file name, which won't work for historical contexts where you want to track evolution of a dataset over time. Having an accepted convention that uses metadata rather than file names, whether it is based on the (overly complex) W3C PROV vocabulary, or another system, is essential to me for providing a workaround for the ALA occurrenceID mistake in future, which will (likely already has) hit some users just as badly as rewrites of scientificName to use the taxonomy or merged taxonomies which are currently accepted by a particular organisation.

I don't agree that we should add more verbatim terms to Darwin Core Terms solely to satisfy existing systems that aren't designed for a "verbatim extensions" model that we haven't developed yet. However, given the verbatim prefix already exists in Darwin Core Terms, it wouldn't be creating a new convention, just continuing an old convention, to create verbatimOccurrenceID and/or verbatimScientificName.

mdoering commented 6 years ago

If the old convention is continued, how bad would it be to create a verbatim term for every term in Darwin Core? At least we had a consistent model then

peterdesmet commented 6 years ago

Couldn't this be done with a dwcverbatim: namespace?

baskaufs commented 5 years ago

I have suggested an approach for recording verbatim information involving the W3C SKOS-XL standard in the issue tdwg/tag#22. The actual process of getting from a provided verbatim string to full metadata associated with SKOS-XL instances is fleshed out more in my comment on TNC Issue 24.

ianengelbrecht commented 5 years ago

Could I suggest that a strategy for verbatim terms be created as a separate Github issue? Returning to the request for dwc:verbatimScientificName in itself, this would be useful. The documentation for dwc:scientificName says 'This term should not contain identification qualifications, which should instead be supplied in the IdentificationQualifier term' (although the example does include a case that includes the identification qualifier). The BDQ TG2 tests and assertions includes TG2-VALIDATION_POLYNOMIAL_NOTSTANDARD, which as currently defined will return NOT_COMPLIANT for any dwc:scientificName values that include a qualifier. We should also be able to represent identifications such as 'Harpactira sp.' in our datasets, and we also have the case of informal names for undescribed species, such as Harpactira sp. 'blue', manuscript names, etc.

ianengelbrecht commented 5 years ago

In a possibly more serious provenance case, the ALA created an issue for itself, GBIF, and the community, a number of years ago with its choice to overwrite the original occurrenceID obtained from scientists with an internal opaque GUID when sending this data to GBIF, but still shows the original occurrenceID on ALA websites/downloads and stores that in the ALA datasets. I have been told by the person who made that decision that it should/must not be fixed (for various reasons). However, without a standard way to express the verbatimOccurrenceID, I also can't provide any workarounds to enable the original data to passthrough unhindered.

@ansell it seems that the practice of creating or overwriting GUIDs is a pervasive problem, probably resulting from a misunderstanding of the purpose of GUIDs in the first place. IMO overwriting dwc:occurrenceID is a misapplication of the standard. Should we modify the standard to cope with its misapplication? Not a route I would advocate for.

ianengelbrecht commented 4 years ago

I see there is an verbatimScientificName field, and an accompanying verbatimScientificNameAuthorship field in a dataset I just downloaded from GBIF.

tucotuco commented 4 years ago

I see there is an verbatimScientificName field, and an accompanying verbatimScientificNameAuthorship field in a dataset I just downloaded from GBIF.

Those must be the dwc:scientificName and dwc:scientitifNameAuthorship data from the originally published source.

I am reviewing all existing Darwin Core issues to try to move them forward or abandon them as the Vocabulary Maintenance Specification demands. This particular issue had a lot of activity, and in the meantime the community has apparently arrived at practical solutions.

I would like to establish if there is still demand for a new term dwc:verbatimScientificName. If there is, someone please follow the process and provide evidence of demand from at least two independent parties and a term definition following the template provided in Guidelines for contributing.

Observation: I think this term would be best organized in the Identification class and have a name that explicitly makes the role of the name apparent, such as "verbatimIdentification".

qgroom commented 4 years ago

Observation: I think this term would be best organized in the Identification class and have a name that explicitly makes the role of the name apparent, such as "verbatimIdentification".

I agree. This issue was part of the inspiration for the discussion on verbatim data we wrote in the publication below. We concluded that versioning was a much better approach.

Quentin Groom, Mathias Dillen, Helen Hardy, Sarah Phillips, Luc Willemse, Zhengzhe Wu, Improved standardization of transcribed digital specimen data, Database, Volume 2019, 2019, baz129, https://doi.org/10.1093/database/baz129

nielsklazenga commented 3 years ago

Observation: I think this term would be best organized in the Identification class and have a name that explicitly makes the role of the name apparent, such as "verbatimIdentification".

https://github.com/tdwg/tnc/issues/24#issuecomment-445459825

dimus commented 3 years ago

dwc:verbatimScientificName to me imposes only one constraint: that this name-string was used to point to a biological OTU or specimen. The users would be open to interpret the string according to their needs. In my case, this field would be parsed and classified into different categories by https://gitlab.com/gogna/gnparser and would be perfect for a wide wariety of tasks.

tucotuco commented 3 years ago

I have changed the title of the issue and prepended a templated term change request to the original comment so as not to have to make a separate issue and relate it to the discussion in this one. Help is needed to know what the equivalent XPATH is in ABCD, if any.

nielsklazenga commented 3 years ago

@tucotuco , there is no equivalent for this term in ABCD 2.06.

tucotuco commented 3 years ago

Thank you @nielsklazenga. Term definition updated and ready to be prepared for public comment.

afuchs1 commented 3 years ago

The Australasian Herbarium Information Systems Committee (HISCOM) endorses the addition of this term to Darwin Core, but proposes to add to the usage notes that verbatimIdentification is best used in addition to scientificName (and identificationQualifier etc.), not instead of it.

tucotuco commented 3 years ago

@afuchs1 That seems a perfectly reasonable amendment to me. If there is no conflicting view, I will add it to the final usage comment. In the meantime, I have put a link to your suggestion in the usage section of the first comment.

hollyel commented 3 years ago

This term will be useful to the paleo collections community for expressing original IDs and the full extent of our knowledge despite nomenclatural uncertainty (e.g., "Genus sp. nov. 1" as illustrated by one of the existing examples). At least with our current systems, this kind of uncertainty and complexity can lead to unexpected results when our data go to aggregators and get matched to taxonomic backbones. - Holly Little, Erica Krimmel (@ekrimmel), and Talia Karim (@tkarim) (on behalf of the Paleo Data Working Group)

EstebanMH-SiB commented 3 years ago

We endorse this proposal on behalf of @SiBColombia

tucotuco commented 3 years ago

Done.