rdmpage / material-examined

Linking specimen codes to identifiers
4 stars 1 forks source link

urn:catalog:CAS:TYPE:1652 not found #4

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

@rdmpage Thanks again for sharing your materials examined prototype openly (for previous discussion see https://discourse.gbif.org/t/finding-a-gbif-occurrence-from-a-specimen-code/3852 ).

I am trying to use your tool in context of https://beehind.org . Here, a type specimen CASTYPE1652 is used as an example. As you probably know, the gbif id associated to the gbif index is handy to have around (e.g., bionomia integration), I was trying to develop a versioned translation table that associates the lingo that collections use (e.g., catalog numbers, collection code, institution code, occurrenceID) to the key into the GBIF data-verse (i.e., gbifID or gbif's own occurrence id).

I tried to shove "CASTYPE1652" into your webtool using:

https://material-examined.herokuapp.com/?q=CASTYPE1652

but alas, for some reason no samples were found.

I figured I am probably misusing your tool, so I was hoping you can help me understand why my query for the type specimen didn't produce any references into the GBIF data-verse.

image

jhpoelen commented 1 year ago

fyi @seltmann @daniel-mietchen

rdmpage commented 1 year ago

Hi @jhpoelen (and @seltmann @Daniel-Mietchen),

I've fixed this CASTYPE1652. Material Examined assumes by default that [A-Z]+[0-9]+ can be parsed as dwc:institutionCode dwc:catalogNumber, which doesn't apply here as CASTYPE1652 is in GBIF as dwc:catalogNumber.

This is the fundamental problem any tool such as Material Examined faces. In the absence of citable persistent identifiers we are mapping inconsistent citation practices to inconsistent data storage practices (sigh).

rdmpage commented 1 year ago

Fixed in https://github.com/rdmpage/material-examined/commit/fa169772c589cb364b971ea44b1eabe886bd26ad

jhpoelen commented 1 year ago

Thanks for humoring me and putting a specific rule in your code for helping to resolve the 17k (or so) CASTYPE catalog number to their associated gbif id.

This is the fundamental problem any tool such as Material Examined faces. In the absence of citable persistent identifiers we are mapping inconsistent citation practices to inconsistent data storage practices (sigh).

Today, I made a list of over 2 billion relations between gbif occurrence ids and their associated occurrenceId, institution code, collection code and catalog number using methods described in https://discourse.gbif.org/t/type-specimen-castype1652-found-via-filtered-query-https-doi-org-10-15468-dl-xf6ahb-but-not-in-open-access-gbif-data-product-https-doi-org-10-15468-dl-pk3trq/3884 (see also https://github.com/beehind/beehind.github.io/issues/5). Incidentally, I found that our CASTYPE1652 wasn't included in that list either. It must be Friday ; )

Here's the first 10 rows (header included) of the > 2 billion relations between gbifID and interpreted occurrenceID etc. as provided via https://doi.org/10.15468/dl.pk3trq [2]

gbifID occurrenceID institutionCode collectionCode catalogNumber
2997162320 3399442 CEPEC CEPEC CEPEC00109669
2997162309 2733085 CEPEC CEPEC CEPEC00000818
2997162317 2733086 CEPEC CEPEC CEPEC00000888
2997162313 3399443 CEPEC CEPEC CEPEC00109744
2997162306 2733087 CEPEC CEPEC CEPEC00000889
2997162316 3399440 CEPEC CEPEC CEPEC00109605
2997162324 2733088 CEPEC CEPEC CEPEC00000890
2997162308 3399441 CEPEC CEPEC CEPEC00109615
2997162303 2733089 CEPEC CEPEC CEPEC00000891

with gbif ids having html landing pages (at least at time of writing) available at:

https://gbif.org/occurrence/2997162320 [1]

(see screenshot below)

where each of the rows can be cited using the associated content identifier (the hash of the lookup table) combined with some cursor (coordinate) in the cited resource.

e.g., line 2 in content with hash [some hash] would be described by shorthand line:hash://sha256/[some hash]/!L2

Wouldn't that create the citable "persistent" identifier that you are looking for?

This, and your handy regex-es would at least narrow down the candidates down from a couple of billion to a perhaps more manageable number of citable candidate gbif id - specimen associations.

References

[1] Amorim A M A, Aguiar C I, Pessoa C (2023). CEPEC herbarium - Centro de Pesquisas do Cacau - Herbário Virtual REFLORA. Version 1.276. Instituto de Pesquisas Jardim Botanico do Rio de Janeiro. Occurrence dataset https://doi.org/10.15468/vg8rjh accessed via GBIF.org on 2023-03-24. https://www.gbif.org/occurrence/2997162320

[2] GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq

image

rdmpage commented 1 year ago

@jhpoelen In a word, "no".

jhpoelen commented 1 year ago

@rdmpage Thanks for taking the time to answer my closed question.

dshorthouse commented 1 year ago

As an aside, gbifID was created to accommodate numerous instances where there either isn't an occurrenceID or it's unhelpfully integer-based (guaranteed not to be unique across all datasets in GBIF) as in the table above. It is computed based on membership of an occurrence record in a particular dataset having a datasetKey. In other words, if an organization elects to republish exactly the same occurrence data via a different dataset vehicle (i.e. datasetKey will be different but all other data precisely the same), new gbifIDs will be issued and previous gbifIDs will produce a deprecated "fragment" when retrieved. While no gbifID will be reused, they can be revived from the same dataset from a cache at GBIF's end if the publisher revives their once used occurrenceIDs in a subsequent republication of the same datasetKey.

I've of course got opinions on all this, but none will move the needle any closer to universal stability and resolution. We've been thrashing this issue for more than fifteen years.