Closed jhpoelen closed 1 year ago
fyi @seltmann @daniel-mietchen
Hi @jhpoelen (and @seltmann @Daniel-Mietchen),
I've fixed this CASTYPE1652. Material Examined assumes by default that [A-Z]+[0-9]+ can be parsed as dwc:institutionCode
dwc:catalogNumber
, which doesn't apply here as CASTYPE1652 is in GBIF as dwc:catalogNumber
.
This is the fundamental problem any tool such as Material Examined faces. In the absence of citable persistent identifiers we are mapping inconsistent citation practices to inconsistent data storage practices (sigh).
Thanks for humoring me and putting a specific rule in your code for helping to resolve the 17k (or so) CASTYPE catalog number to their associated gbif id.
This is the fundamental problem any tool such as Material Examined faces. In the absence of citable persistent identifiers we are mapping inconsistent citation practices to inconsistent data storage practices (sigh).
Today, I made a list of over 2 billion relations between gbif occurrence ids and their associated occurrenceId, institution code, collection code and catalog number using methods described in https://discourse.gbif.org/t/type-specimen-castype1652-found-via-filtered-query-https-doi-org-10-15468-dl-xf6ahb-but-not-in-open-access-gbif-data-product-https-doi-org-10-15468-dl-pk3trq/3884 (see also https://github.com/beehind/beehind.github.io/issues/5). Incidentally, I found that our CASTYPE1652 wasn't included in that list either. It must be Friday ; )
Here's the first 10 rows (header included) of the > 2 billion relations between gbifID and interpreted occurrenceID etc. as provided via https://doi.org/10.15468/dl.pk3trq [2]
gbifID | occurrenceID | institutionCode | collectionCode | catalogNumber |
---|---|---|---|---|
2997162320 | 3399442 | CEPEC | CEPEC | CEPEC00109669 |
2997162309 | 2733085 | CEPEC | CEPEC | CEPEC00000818 |
2997162317 | 2733086 | CEPEC | CEPEC | CEPEC00000888 |
2997162313 | 3399443 | CEPEC | CEPEC | CEPEC00109744 |
2997162306 | 2733087 | CEPEC | CEPEC | CEPEC00000889 |
2997162316 | 3399440 | CEPEC | CEPEC | CEPEC00109605 |
2997162324 | 2733088 | CEPEC | CEPEC | CEPEC00000890 |
2997162308 | 3399441 | CEPEC | CEPEC | CEPEC00109615 |
2997162303 | 2733089 | CEPEC | CEPEC | CEPEC00000891 |
with gbif ids having html landing pages (at least at time of writing) available at:
https://gbif.org/occurrence/2997162320 [1]
(see screenshot below)
where each of the rows can be cited using the associated content identifier (the hash of the lookup table) combined with some cursor (coordinate) in the cited resource.
e.g., line 2 in content with hash [some hash] would be described by shorthand line:hash://sha256/[some hash]/!L2
Wouldn't that create the citable "persistent" identifier that you are looking for?
This, and your handy regex-es would at least narrow down the candidates down from a couple of billion to a perhaps more manageable number of citable candidate gbif id - specimen associations.
[1] Amorim A M A, Aguiar C I, Pessoa C (2023). CEPEC herbarium - Centro de Pesquisas do Cacau - Herbário Virtual REFLORA. Version 1.276. Instituto de Pesquisas Jardim Botanico do Rio de Janeiro. Occurrence dataset https://doi.org/10.15468/vg8rjh accessed via GBIF.org on 2023-03-24. https://www.gbif.org/occurrence/2997162320
[2] GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq
@jhpoelen In a word, "no".
@rdmpage Thanks for taking the time to answer my closed question.
As an aside, gbifID
was created to accommodate numerous instances where there either isn't an occurrenceID
or it's unhelpfully integer-based (guaranteed not to be unique across all datasets in GBIF) as in the table above. It is computed based on membership of an occurrence record in a particular dataset having a datasetKey
. In other words, if an organization elects to republish exactly the same occurrence data via a different dataset vehicle (i.e. datasetKey
will be different but all other data precisely the same), new gbifIDs
will be issued and previous gbifIDs
will produce a deprecated "fragment" when retrieved. While no gbifID
will be reused, they can be revived from the same dataset from a cache at GBIF's end if the publisher revives their once used occurrenceIDs
in a subsequent republication of the same datasetKey
.
I've of course got opinions on all this, but none will move the needle any closer to universal stability and resolution. We've been thrashing this issue for more than fifteen years.
@rdmpage Thanks again for sharing your materials examined prototype openly (for previous discussion see https://discourse.gbif.org/t/finding-a-gbif-occurrence-from-a-specimen-code/3852 ).
I am trying to use your tool in context of https://beehind.org . Here, a type specimen CASTYPE1652 is used as an example. As you probably know, the gbif id associated to the gbif index is handy to have around (e.g., bionomia integration), I was trying to develop a versioned translation table that associates the lingo that collections use (e.g., catalog numbers, collection code, institution code, occurrenceID) to the key into the GBIF data-verse (i.e., gbifID or gbif's own occurrence id).
I tried to shove "CASTYPE1652" into your webtool using:
https://material-examined.herokuapp.com/?q=CASTYPE1652
but alas, for some reason no samples were found.
I figured I am probably misusing your tool, so I was hoping you can help me understand why my query for the type specimen didn't produce any references into the GBIF data-verse.