Closed whitfarnum closed 1 year ago
hi @whitfarnum good to hear from you and thanks for your detailed examples on ways to help nomer a little more useful for your work.
Just curious - how do you typically use Nomer?
@jhpoelen I use it as an aid to our inventory process and curation process 1) Go through the cabient, drawer, and trays one by one and write down all the name that appear. Also write down the location in the collection, how many specimens and how many unit trays. 2) Create a list of species names from the inventory and feed it into nomer. 3) Put the nomer results into a Python prgram that updates the names if no conflicts about current name exist. 4) deal with the name that had no results or had multiple result manual 5) update higher taxonomy for the species 6) print unit tray label and curate the group
I am currently working on Scarabaeidae and catalog of life has the best online resource for that so I am only searching catalog of life. I am looking at the catalog of life API to fill in higher taxonomy but it feels like it is still under development.
@whitfarnum Thanks for detailing your workflow. Hoping to do what I can to add the additional taxon ranks as separate fields sooner rather than later. Thanks for being patient.
@whitfarnum I've expanded the alignment schema to include subfamily, tribe and subtribe as you suggested. In running your example Aegialia arenaria, I found that expected values appeared when matching against ITIS / NCBI. For report see https://github.com/globalbioticinteractions/name-alignment-template/actions/runs/5682620836 and attached . For example snippet from report, see below.
Can you confirm?
alignedFamilyName | alignedFamilyId | alignedSubfamilyName | alignedSubfamilyId | alignedTribeName | alignedTribeId | alignedSubtribeName | alignedSubtribeId | alignedGenusName | alignedGenusId | alignedSubgenusName | alignedSubgenusId | alignedSpeciesName | alignedSpeciesId |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Scarabaeidae | ITIS:114493 | Aphodiinae | ITIS:678499 | Aegialiini | ITIS:926254 | Aegialia | ITIS:926301 | Aegialia (Aegialia) | ITIS:926677 | Aegialia arenaria | ITIS:926708 | ||
Scarabaeidae | NCBI:7055 | Aphodiinae | NCBI:166306 | Aegialia | NCBI:206942 | Aegialia arenaria | NCBI:206943 |
@jhpoelen I will try this out when I finish the group I am currently inventorying. Probably next week.
@whitfarnum great! Looking forward to hearing your notes.
@jhpoelen I ran a test and higher taxonomy is with subfamily and tribe is exactly what I want. I encountered a bug where the alignedExternalId from catalog of life gives a 404 error. It happened for the first two URLs I spot checked but it did not repeat on the other ones. a list of the errors are below. It is a low frequency but getting the wrong externalID should be looked at.
Adoretosoma elegans Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/64SCW gives a 404 error
Search catalog of life for Adoretosoma elegans url: https://www.catalogueoflife.org/data/taxon/9JQSG
Anisoplia baetica Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/7RCD4 gives a 404 error catalog of life: https://www.catalogueoflife.org/data/taxon/9JRX8
Paracotalpa ursina Works Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/75NC5
Lagochile trigona Works Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/6NV7R
Strigoderma arboricola Works Nomer: alignedExternalId: ttps://www.catalogueoflife.org/data/taxon/5322V
Nomer: alignedExternalId: Chlorota aulica Works https://www.catalogueoflife.org/data/taxon/5XZ78 globi_Scarab_Rutelinae.csv
Adoretus semperi Works https://www.catalogueoflife.org/data/taxon/8VR2D globi_Scarab_Rutelinae.csv
I have som number.s I submitted ~650 names. 550 of the names matched in catalog of life. I then tested the url associated with each name. ~13% of the names gave a 404 error. The results file is attached.
@whitfarnum thanks for sharing your detailed notes.
As I might have told you, Nomer uses a versioned copy of Catalogue of Life (i.e., accessed on 2022-09-09T20:05:22.601Z at https://download.catalogueoflife.org/col/latest_coldp.zip with signature hash://sha256/9ac28297a996e02f6026c40d24e67f59f7f39d495bb45759ebc4adb475d51f63 or hash://md5/ce89c200aab5be1b439647c1ac72813f as part of [1]). In this versioned (and signed) copy, I was able to locate the taxon ids that Catalogue of Life appears to have forgotten in their more recent version.
Adoretosoma elegans Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/64SCW gives a 404 error catalog of life (accessed on 2023-07-28?) - https://www.catalogueoflife.org/data/taxon/9JQSG
Anisoplia baetica Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/7RCD4 gives a 404 error catalog of life (accessed on 2023-07-28?): https://www.catalogueoflife.org/data/taxon/9JRX8
preston cat\
--remote https://linker.bio/,https://zenodo.org/record/8125362/files/\
'zip:hash://md5/ce89c200aab5be1b439647c1ac72813f!/NameUsage.tsv'\
| mlr --tsvlite filter '${col:ID} == "64SCW" || ${col:ID} == "9JQSG" || ${col:ID} == "7RCD4" || ${col:ID} == "9JRX8"'\
> suspicious-taxa-with-404s-or-replacement-id.tsv
See results attached and below in markdown table.
Note that, in the versioned copy of Catalogue of Life, the taxon ids 64SCW
and 7RCD4
were found, but the ones you found via the current catalogue of life web interface (9JQSG
, 9JRX8
) do not appear in the older versioned copy that is part of the Nomer Corpus of Taxonomic Resources.
This suggests that Catalogue of Life "forgot" or stopped using / redirecting at least 2 previously issued taxon ids.
suspicious-taxa-with-404s-or-replacement-id.tsv.txt
col:ID | col:sourceID | col:parentID | col:basionymID | col:status | col:scientificName | col:authorship | col:rank | col:notho | col:uninomial | col:genericName | col:infragenericEpithet | col:specificEpithet | col:infraspecificEpithet | col:cultivarEpithet | col:namePhrase | col:nameReferenceID | col:publishedInYear | col:publishedInPage | col:publishedInPageLink | col:code | col:nameStatus | col:accordingToID | col:accordingToPage | col:accordingToPageLink | col:referenceID | col:scrutinizer | col:scrutinizerID | col:scrutinizerDate | col:extinct | col:temporalRangeStart | col:temporalRangeEnd | col:environment | col:species | col:section | col:subgenus | col:genus | col:subtribe | col:tribe | col:subfamily | col:family | col:superfamily | col:suborder | col:order | col:subclass | col:class | col:subphylum | col:phylum | col:kingdom | col:sequenceIndex | col:branchLength | col:link | col:nameRemarks | col:remarks |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
64SCW | 1027 | PCW | accepted | Adoretosoma elegans | Blanchard, 1850 | species | Adoretosoma | elegans | b09376b7-bdbc-4867-9aaa-96b53201d60b | 234 | zoological | b09376b7-bdbc-4867-9aaa-96b53201d60b | false | ||||||||||||||||||||||||||||||||||||||||
7RCD4 | 1027 | 7Q75V | accepted | Anisoplia (Anisoplia) baetica | Erichson, 1848 | species | Anisoplia | Anisoplia | baetica | da86979e-91c4-44ed-a8cf-ee2eef2c9f94 | 636 | zoological | da86979e-91c4-44ed-a8cf-ee2eef2c9f94 | false |
[1] Poelen, Jorrit H. (2023). Nomer Corpus of Taxonomic Resources hash://sha256/0e9bc57bc082b58a2c7a509bb73362b258ec8ddfc6664898e25c639786413fda hash://md5/91dd844e787ffae8f0a2bbb8c1f29192 (0.16) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8125362
Now, I wonder if the Catalogue of Life team can share what happened to taxon ids 64SCW
and 7RCD4
after 2022-09-09T20:05:22.601Z . Also, I wonder what their policy is on retiring taxonomic identifiers. Also, I wonder whether they issue intend to have their col:ID
be used as "stable" taxonomic identifiers for external use.
This seems especially relevant because GBIF is moving to adopt the Catalogue of Life as their backbone taxonomy.
from https://www.gbif.org/publisher/f4ce3c03-7b38-445e-86e6-5f6b04b649d4 -
Description In June 2001 the Species 2000 and ITIS organisations, that had previously worked separately, decided to work together to create the Catalogue of Life, now estimated at 1.9 million species (Chapman, 2009). The two organisations remain separate and different in structure. However, by working together in creating a common product, the partnership has enabled them to reduce duplication of effort, make better use of resources, and to accelerate production. The combined Annual Checklist has become well established as a cited reference used for data compilation and comparison. For instance, it is used as the principal taxonomic index in the GBIF and EoL data portals and recognised by the CBD.
@mdoering @gdower Can you please help us understand the policy on taxonomic ids issued by Catalogue of Life?
Note, by the way, that a identical query to https://github.com/globalbioticinteractions/nomer/issues/159#issuecomment-1655947728 applied to the current version of Catalogue of Life yielded only the "newer" taxonomic identifers, and the older ones were not to be found. I am probably looking in the wrong location, so I'd appreciate any insight that you may have on the internals of Catalogue of Life.
preston cat\
'zip:hash://sha256/ed955cdf758cab5e0bc21e8aecefac31166d4ae67f6954fd44785f6a537144bd!/NameUsage.tsv'\
| mlr --tsvlite filter '${col:ID} == "64SCW" || ${col:ID} == "9JQSG" || ${col:ID} == "7RCD4" || ${col:ID} == "9JRX8"'\
> 2023-07-28-col-suspicious-taxa-with-404s-or-replacement-id.tsv.txt
2023-07-28-col-suspicious-taxa-with-404s-or-replacement-id.tsv.txt
col:ID | col:alternativeID | col:nameAlternativeID | col:sourceID | col:parentID | col:basionymID | col:status | col:scientificName | col:authorship | col:rank | col:notho | col:uninomial | col:genericName | col:infragenericEpithet | col:specificEpithet | col:infraspecificEpithet | col:cultivarEpithet | col:namePhrase | col:nameReferenceID | col:publishedInYear | col:publishedInPage | col:publishedInPageLink | col:code | col:nameStatus | col:accordingToID | col:accordingToPage | col:accordingToPageLink | col:referenceID | col:scrutinizer | col:scrutinizerID | col:scrutinizerDate | col:extinct | col:temporalRangeStart | col:temporalRangeEnd | col:environment | col:species | col:section | col:subgenus | col:genus | col:subtribe | col:tribe | col:subfamily | col:family | col:superfamily | col:suborder | col:order | col:subclass | col:class | col:subphylum | col:phylum | col:kingdom | col:sequenceIndex | col:branchLength | col:link | col:nameRemarks | col:remarks |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9JQSG | 1027 | 9JGRG | accepted | Adoretosoma elegans | Blanchard, 1851 | species | Adoretosoma | elegans | 924ec3e7-852c-4541-beff-4a70f7b4d225 | 234 | zoological | 924ec3e7-852c-4541-beff-4a70f7b4d225 | false | ||||||||||||||||||||||||||||||||||||||||||
9JRX8 | 1027 | 7Q75V | accepted | Anisoplia (Anisoplia) baetica | Erichson, 1847 | species | Anisoplia | Anisoplia | baetica | bb3e86bd-6d21-4852-92b6-9e892279ec27 | 636 | zoological | bb3e86bd-6d21-4852-92b6-9e892279ec27 | false |
As far as I can tell, the original issue, adding subfamily, tribe and subtribe has been resolved.
Transferring another issue to a more suitable location - the catalogue of life tracker.
I've transferred the Catalogue of Life taxonomic id question to https://github.com/CatalogueOfLife/general/issues/98 .
The author string changed for Adoretosoma elegans, which results in a new ID being minted:
https://api.checklistbank.org/dataset/COL2022/taxon/64SCW https://api.checklistbank.org/dataset/COL2023/taxon/9JQSG
The annual checklist 2022 was released in August 2022, a month before @jhpoelen harvested it. @whitfarnum, if you can point to the API like with the link above which uses the COL2022 alias as the dataset_id, the annual checklists won't ever be deleted and although ChecklistBank is still under development, the taxon endpoint has been stable for at least 3 years. Unfortunately, there's no way to load the COL2022 annual checklist into the catalogueoflife.org portal to link to it there.
You may be interested in:
@whitfarnum, I think @jhpoelen harvested the 2022 COL Annual Checklist so you should have no 404 errors if you use the API links with COL2022 as the dataset_id. Otherwise if it were the September release, you could potentially expect IDs to be broken for these names in COL2022, which shows a difference between the Scarabs dataset the Aug 2022 and Sept 2022 releases:
https://www.checklistbank.org/dataset/1027/diff?attempts=92..93
200: https://api.checklistbank.org/dataset/COL2022/taxon/3SGCB 404: https://api.checklistbank.org/dataset/9840/taxon/3SGCB
new url: 200: https://api.checklistbank.org/dataset/9840/taxon/9G4RN
But that should still be a lot less 404 errors compared with using links to the current COL release. I wouldn't link to monthly COL releases because eventually they will get deleted.
@whitfarnum Your careful observations led to an exchange with one of the Catalogue of Life developers @ https://github.com/CatalogueOfLife/general/issues/98#issuecomment-1677075688 . Curious to hear your thoughts on this.
@jhpoelen I am using the catalog of life data to curate our scarabs. Because I am using a single trusted source I can automate a lot of the name update procedures using the export results form Nomer. Here are some proposed features I think will make data updating smoother.
In the higher taxonomy the data jumps from alignedFamilyName down to alignedGenusName. This ignore SubFamily, Tribe, and Subtribe. Those data are available in alignedPath and alignedPathNames fields. Having them already parsed out like family and genus would save me a lot of manual copy and paste. Examples: Biota | Animalia | Arthropoda | Insecta | Coleoptera | Scarabaeoidea | Scarabaeidae | Aegialiinae | Aegialiini | Aegialia | Aegialia arenaria unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus | species
For catalog of life, at least they have the author attached to higher taxonomy and if it could be included in the export that would save a lot of copy and paste and the back end. Here is a sample higher taxa link form catalog of life. https://www.catalogueoflife.org/data/taxon/8RYS7