globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

Better taxanomic resolution between family and genus on export by including SubFamily, Tribe, and Subtribe #159

Closed whitfarnum closed 1 year ago

whitfarnum commented 1 year ago

@jhpoelen I am using the catalog of life data to curate our scarabs. Because I am using a single trusted source I can automate a lot of the name update procedures using the export results form Nomer. Here are some proposed features I think will make data updating smoother.

In the higher taxonomy the data jumps from alignedFamilyName down to alignedGenusName. This ignore SubFamily, Tribe, and Subtribe. Those data are available in alignedPath and alignedPathNames fields. Having them already parsed out like family and genus would save me a lot of manual copy and paste. Examples: Biota | Animalia | Arthropoda | Insecta | Coleoptera | Scarabaeoidea | Scarabaeidae | Aegialiinae | Aegialiini | Aegialia | Aegialia arenaria unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus | species

For catalog of life, at least they have the author attached to higher taxonomy and if it could be included in the export that would save a lot of copy and paste and the back end. Here is a sample higher taxa link form catalog of life. https://www.catalogueoflife.org/data/taxon/8RYS7

jhpoelen commented 1 year ago

hi @whitfarnum good to hear from you and thanks for your detailed examples on ways to help nomer a little more useful for your work.

Just curious - how do you typically use Nomer?

whitfarnum commented 1 year ago

@jhpoelen I use it as an aid to our inventory process and curation process 1) Go through the cabient, drawer, and trays one by one and write down all the name that appear. Also write down the location in the collection, how many specimens and how many unit trays. 2) Create a list of species names from the inventory and feed it into nomer. 3) Put the nomer results into a Python prgram that updates the names if no conflicts about current name exist. 4) deal with the name that had no results or had multiple result manual 5) update higher taxonomy for the species 6) print unit tray label and curate the group

I am currently working on Scarabaeidae and catalog of life has the best online resource for that so I am only searching catalog of life. I am looking at the catalog of life API to fill in higher taxonomy but it feels like it is still under development.

jhpoelen commented 1 year ago

@whitfarnum Thanks for detailing your workflow. Hoping to do what I can to add the additional taxon ranks as separate fields sooner rather than later. Thanks for being patient.

jhpoelen commented 1 year ago

@whitfarnum I've expanded the alignment schema to include subfamily, tribe and subtribe as you suggested. In running your example Aegialia arenaria, I found that expected values appeared when matching against ITIS / NCBI. For report see https://github.com/globalbioticinteractions/name-alignment-template/actions/runs/5682620836 and attached . For example snippet from report, see below.

Can you confirm?

alignedFamilyName alignedFamilyId alignedSubfamilyName alignedSubfamilyId alignedTribeName alignedTribeId alignedSubtribeName alignedSubtribeId alignedGenusName alignedGenusId alignedSubgenusName alignedSubgenusId alignedSpeciesName alignedSpeciesId
                           
Scarabaeidae ITIS:114493 Aphodiinae ITIS:678499 Aegialiini ITIS:926254     Aegialia ITIS:926301 Aegialia (Aegialia) ITIS:926677 Aegialia arenaria ITIS:926708
Scarabaeidae NCBI:7055 Aphodiinae NCBI:166306         Aegialia NCBI:206942     Aegialia arenaria NCBI:206943

alignment-report.zip

whitfarnum commented 1 year ago

@jhpoelen I will try this out when I finish the group I am currently inventorying. Probably next week.

jhpoelen commented 1 year ago

@whitfarnum great! Looking forward to hearing your notes.

whitfarnum commented 1 year ago

@jhpoelen I ran a test and higher taxonomy is with subfamily and tribe is exactly what I want. I encountered a bug where the alignedExternalId from catalog of life gives a 404 error. It happened for the first two URLs I spot checked but it did not repeat on the other ones. a list of the errors are below. It is a low frequency but getting the wrong externalID should be looked at.

Adoretosoma elegans Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/64SCW gives a 404 error

Search catalog of life for Adoretosoma elegans url: https://www.catalogueoflife.org/data/taxon/9JQSG

Anisoplia baetica Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/7RCD4 gives a 404 error catalog of life: https://www.catalogueoflife.org/data/taxon/9JRX8

Paracotalpa ursina Works Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/75NC5

Lagochile trigona Works Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/6NV7R

Strigoderma arboricola Works Nomer: alignedExternalId: ttps://www.catalogueoflife.org/data/taxon/5322V

Nomer: alignedExternalId: Chlorota aulica Works https://www.catalogueoflife.org/data/taxon/5XZ78 globi_Scarab_Rutelinae.csv

Adoretus semperi Works https://www.catalogueoflife.org/data/taxon/8VR2D globi_Scarab_Rutelinae.csv

I have som number.s I submitted ~650 names. 550 of the names matched in catalog of life. I then tested the url associated with each name. ~13% of the names gave a 404 error. The results file is attached.

jhpoelen commented 1 year ago

@whitfarnum thanks for sharing your detailed notes.

As I might have told you, Nomer uses a versioned copy of Catalogue of Life (i.e., accessed on 2022-09-09T20:05:22.601Z at https://download.catalogueoflife.org/col/latest_coldp.zip with signature hash://sha256/9ac28297a996e02f6026c40d24e67f59f7f39d495bb45759ebc4adb475d51f63 or hash://md5/ce89c200aab5be1b439647c1ac72813f as part of [1]). In this versioned (and signed) copy, I was able to locate the taxon ids that Catalogue of Life appears to have forgotten in their more recent version.

Adoretosoma elegans Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/64SCW gives a 404 error catalog of life (accessed on 2023-07-28?) - https://www.catalogueoflife.org/data/taxon/9JQSG

Anisoplia baetica Nomer: alignedExternalId: https://www.catalogueoflife.org/data/taxon/7RCD4 gives a 404 error catalog of life (accessed on 2023-07-28?): https://www.catalogueoflife.org/data/taxon/9JRX8

preston cat\
 --remote https://linker.bio/,https://zenodo.org/record/8125362/files/\
 'zip:hash://md5/ce89c200aab5be1b439647c1ac72813f!/NameUsage.tsv'\
 | mlr --tsvlite filter '${col:ID} == "64SCW" || ${col:ID} == "9JQSG" || ${col:ID} == "7RCD4" || ${col:ID} == "9JRX8"'\
 > suspicious-taxa-with-404s-or-replacement-id.tsv 

See results attached and below in markdown table.

Note that, in the versioned copy of Catalogue of Life, the taxon ids 64SCW and 7RCD4 were found, but the ones you found via the current catalogue of life web interface (9JQSG, 9JRX8) do not appear in the older versioned copy that is part of the Nomer Corpus of Taxonomic Resources.

This suggests that Catalogue of Life "forgot" or stopped using / redirecting at least 2 previously issued taxon ids.

suspicious-taxa-with-404s-or-replacement-id.tsv.txt

col:ID col:sourceID col:parentID col:basionymID col:status col:scientificName col:authorship col:rank col:notho col:uninomial col:genericName col:infragenericEpithet col:specificEpithet col:infraspecificEpithet col:cultivarEpithet col:namePhrase col:nameReferenceID col:publishedInYear col:publishedInPage col:publishedInPageLink col:code col:nameStatus col:accordingToID col:accordingToPage col:accordingToPageLink col:referenceID col:scrutinizer col:scrutinizerID col:scrutinizerDate col:extinct col:temporalRangeStart col:temporalRangeEnd col:environment col:species col:section col:subgenus col:genus col:subtribe col:tribe col:subfamily col:family col:superfamily col:suborder col:order col:subclass col:class col:subphylum col:phylum col:kingdom col:sequenceIndex col:branchLength col:link col:nameRemarks col:remarks
64SCW 1027 PCW accepted Adoretosoma elegans Blanchard, 1850 species Adoretosoma elegans b09376b7-bdbc-4867-9aaa-96b53201d60b 234 zoological b09376b7-bdbc-4867-9aaa-96b53201d60b false
7RCD4 1027 7Q75V accepted Anisoplia (Anisoplia) baetica Erichson, 1848 species Anisoplia Anisoplia baetica da86979e-91c4-44ed-a8cf-ee2eef2c9f94 636 zoological da86979e-91c4-44ed-a8cf-ee2eef2c9f94 false

[1] Poelen, Jorrit H. (2023). Nomer Corpus of Taxonomic Resources hash://sha256/0e9bc57bc082b58a2c7a509bb73362b258ec8ddfc6664898e25c639786413fda hash://md5/91dd844e787ffae8f0a2bbb8c1f29192 (0.16) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8125362

jhpoelen commented 1 year ago

Now, I wonder if the Catalogue of Life team can share what happened to taxon ids 64SCW and 7RCD4 after 2022-09-09T20:05:22.601Z . Also, I wonder what their policy is on retiring taxonomic identifiers. Also, I wonder whether they issue intend to have their col:ID be used as "stable" taxonomic identifiers for external use.

This seems especially relevant because GBIF is moving to adopt the Catalogue of Life as their backbone taxonomy.

from https://www.gbif.org/publisher/f4ce3c03-7b38-445e-86e6-5f6b04b649d4 -

Description In June 2001 the Species 2000 and ITIS organisations, that had previously worked separately, decided to work together to create the Catalogue of Life, now estimated at 1.9 million species (Chapman, 2009). The two organisations remain separate and different in structure. However, by working together in creating a common product, the partnership has enabled them to reduce duplication of effort, make better use of resources, and to accelerate production. The combined Annual Checklist has become well established as a cited reference used for data compilation and comparison. For instance, it is used as the principal taxonomic index in the GBIF and EoL data portals and recognised by the CBD.

@mdoering @gdower Can you please help us understand the policy on taxonomic ids issued by Catalogue of Life?

Screenshot from 2023-07-28 11-23-39

jhpoelen commented 1 year ago

Note, by the way, that a identical query to https://github.com/globalbioticinteractions/nomer/issues/159#issuecomment-1655947728 applied to the current version of Catalogue of Life yielded only the "newer" taxonomic identifers, and the older ones were not to be found. I am probably looking in the wrong location, so I'd appreciate any insight that you may have on the internals of Catalogue of Life.

preston cat\
 'zip:hash://sha256/ed955cdf758cab5e0bc21e8aecefac31166d4ae67f6954fd44785f6a537144bd!/NameUsage.tsv'\
 | mlr --tsvlite filter '${col:ID} == "64SCW" || ${col:ID} == "9JQSG" || ${col:ID} == "7RCD4" || ${col:ID} == "9JRX8"'\
 > 2023-07-28-col-suspicious-taxa-with-404s-or-replacement-id.tsv.txt 

2023-07-28-col-suspicious-taxa-with-404s-or-replacement-id.tsv.txt

col:ID col:alternativeID col:nameAlternativeID col:sourceID col:parentID col:basionymID col:status col:scientificName col:authorship col:rank col:notho col:uninomial col:genericName col:infragenericEpithet col:specificEpithet col:infraspecificEpithet col:cultivarEpithet col:namePhrase col:nameReferenceID col:publishedInYear col:publishedInPage col:publishedInPageLink col:code col:nameStatus col:accordingToID col:accordingToPage col:accordingToPageLink col:referenceID col:scrutinizer col:scrutinizerID col:scrutinizerDate col:extinct col:temporalRangeStart col:temporalRangeEnd col:environment col:species col:section col:subgenus col:genus col:subtribe col:tribe col:subfamily col:family col:superfamily col:suborder col:order col:subclass col:class col:subphylum col:phylum col:kingdom col:sequenceIndex col:branchLength col:link col:nameRemarks col:remarks
9JQSG 1027 9JGRG accepted Adoretosoma elegans Blanchard, 1851 species Adoretosoma elegans 924ec3e7-852c-4541-beff-4a70f7b4d225 234 zoological 924ec3e7-852c-4541-beff-4a70f7b4d225 false
9JRX8 1027 7Q75V accepted Anisoplia (Anisoplia) baetica Erichson, 1847 species Anisoplia Anisoplia baetica bb3e86bd-6d21-4852-92b6-9e892279ec27 636 zoological bb3e86bd-6d21-4852-92b6-9e892279ec27 false
jhpoelen commented 1 year ago

As far as I can tell, the original issue, adding subfamily, tribe and subtribe has been resolved.

Transferring another issue to a more suitable location - the catalogue of life tracker.

jhpoelen commented 1 year ago

I've transferred the Catalogue of Life taxonomic id question to https://github.com/CatalogueOfLife/general/issues/98 .

gdower commented 1 year ago

The author string changed for Adoretosoma elegans, which results in a new ID being minted:

https://api.checklistbank.org/dataset/COL2022/taxon/64SCW https://api.checklistbank.org/dataset/COL2023/taxon/9JQSG

The annual checklist 2022 was released in August 2022, a month before @jhpoelen harvested it. @whitfarnum, if you can point to the API like with the link above which uses the COL2022 alias as the dataset_id, the annual checklists won't ever be deleted and although ChecklistBank is still under development, the taxon endpoint has been stable for at least 3 years. Unfortunately, there's no way to load the COL2022 annual checklist into the catalogueoflife.org portal to link to it there.

You may be interested in:

https://github.com/CatalogueOfLife/backend/issues/1083

gdower commented 1 year ago

@whitfarnum, I think @jhpoelen harvested the 2022 COL Annual Checklist so you should have no 404 errors if you use the API links with COL2022 as the dataset_id. Otherwise if it were the September release, you could potentially expect IDs to be broken for these names in COL2022, which shows a difference between the Scarabs dataset the Aug 2022 and Sept 2022 releases:

https://www.checklistbank.org/dataset/1027/diff?attempts=92..93

200: https://api.checklistbank.org/dataset/COL2022/taxon/3SGCB 404: https://api.checklistbank.org/dataset/9840/taxon/3SGCB

new url: 200: https://api.checklistbank.org/dataset/9840/taxon/9G4RN

But that should still be a lot less 404 errors compared with using links to the current COL release. I wouldn't link to monthly COL releases because eventually they will get deleted.

jhpoelen commented 1 year ago

@whitfarnum Your careful observations led to an exchange with one of the Catalogue of Life developers @ https://github.com/CatalogueOfLife/general/issues/98#issuecomment-1677075688 . Curious to hear your thoughts on this.