Closed jhpoelen closed 2 years ago
The name alignment review process used was:
which uses Nomer v0.2.13 (April 2022) in combination with pre-processed COL index - https://github.com/globalbioticinteractions/nomer/releases/download/0.2.13/catalogue_of_life_mapdb.zip .
https://github.com/globalbioticinteractions/nomer/releases/tag/0.2.13
After installing the COL pre-computed index and Nomer 0.2.13, I was able to reproduce the suspected false negative match for "Aglais io" -
installing versions -
curl -L "https://github.com/globalbioticinteractions/nomer/releases/download/0.2.13/nomer.jar" > nomer.jar
curl -L "https://github.com/globalbioticinteractions/nomer/releases/download/0.2.13/catalogue_of_life_mapdb.zip" > catalogue_of_life_mapdb.zip
create nomer cache and unpack COL index
mkdir .nomer
cd .nomer && unzip ../catalogue_of_life_mapdb.zip
to reproduce the suspicious result:
echo -e "\tAglais io" | java -jar nomer.jar append col
yielded:
[...]
Aglais io NONE Aglais io
Same results were obtained after rebuilding the Catalogue of Life index using Nomer v0.2.13 -
$ echo -e "\tAglais io" | java -jar nomer.jar append col
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [col]
[main] INFO org.globalbioticinteractions.nomer.match.CatalogueOfLifeTaxonService - [CATALOGUE_OF_LIFE] taxonomy importing...
[main] INFO org.globalbioticinteractions.nomer.match.ResourceServiceContentBased - using local Preston data dir: [/home/jorrit/tmp/alignment/./.nomer/data]
[main] INFO org.globalbioticinteractions.nomer.match.ResourceServiceContentBased - caching [zip:https://download.catalogueoflife.org/col/latest_coldp.zip!/NameUsage.tsv] at [/home/jorrit/tmp/alignment/./.nomer/tmp/nomer10203795427066328567.gz]...
[https://zenodo.org/recor...210ef59e06b640a3539cb5a] 100.0% of 78 bytes at 0.01 MB/s completed in < 1 minute
[https://zenodo.org/recor...eac1915f7c3be9748eda991] 100.0% of 22 kB at 5.46 MB/s completed in < 1 minute
[https://zenodo.org/recor...dd472e8ae9c3f66f4932c62] 100.0% of 78 bytes at 0.07 MB/s completed in < 1 minute
[https://zenodo.org/recor...4e871ad1ec0a8b283091e08] 100.0% of 26 kB at 13.07 MB/s completed in < 1 minute
[https://zenodo.org/recor...ac9a12e14bfa72ccb0a6828] 100.0% of 78 bytes at 0.07 MB/s completed in < 1 minute
[https://zenodo.org/recor...f20e2eedbb2fb585bdf0822] 100.0% of 32 kB at 15.91 MB/s completed in < 1 minute
[https://zenodo.org/recor...5bba06ad7b40eea7c8c9831] 100.0% of 78 bytes at 0.07 MB/s completed in < 1 minute
[https://zenodo.org/recor...ae02993ebf7a0fe30d87137] 100.0% of 2 kB at ? MB/s completed in < 1 minute
[https://zenodo.org/recor...7a57d9ac636bd2136ed64d8] 0.0% of 439 MB at 0.23 MB/s[https://zenodo.org/recor...7a57d9ac636bd2136ed64d8]
[...]
[main] INFO org.globalbioticinteractions.nomer.match.CatalogueOfLifeTaxonService - cache with [4438467] items built in [912.3] s or [4865.2] items/s.
[main] INFO org.globalbioticinteractions.nomer.match.CatalogueOfLifeTaxonService - [CATALOGUE_OF_LIFE] taxonomy imported.
Aglais io NONE Aglais io
using Nomer's Corpus of Taxonomic Resources v0.4
Poelen, Jorrit H. (2022). Nomer Corpus of Taxonomic Resources hash://sha256/6224f259190590c7aed4784de2b27b3005eea0042ae02993ebf7a0fe30d87137 (0.4) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6473194 -
$ java -jar nomer.jar properties | grep preston
nomer.preston.dir=
nomer.preston.remotes=https://zenodo.org/record/6473194/files
nomer.preston.version=hash://sha256/d58ab1acf350f056a75bde7f4175d14c5e4dfaf0bf20e2eedbb2fb585bdf0822
and, on retrieving the exact version of the CoL resource via:
preston cat 'zip:https://download.catalogueoflife.org/col/latest_coldp.zip!/NameUsage.tsv' --remote https://zenodo.org/record/6473194/files | grep "Aglais io"
yielded:
65S3F 1018 5TRRN synonym Aglais ioprotoformis Reuss, 1909 species Aglais ioprotoformis zoological
65RRF 1018 5TRRN synonym Aglais ioformis Reuss, 1909 species Aglais ioformis
So, for some reason "Aglais ioprotoformis" and "Aglais ioformis" were included in COL as included in Nomer Corpus of Taxonomic Resource v0.4, but not "Aglais io".
This is consistent with the results produced by Nomer v0.2.13 -
$ echo -e "\tAglais ioprotoformis\n\tAglais ioformis" | nomer append col
Aglais ioprotoformis SYNONYM_OF COL:5TRRN Aglais urticae (Linnaeus, 1758) species Biota | Animalia | Arthropoda | Insecta | Lepidoptera | Papilionoidea | Nymphalidae | Nymphalinae | Nymphalini | Aglais | Aglais urticae COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:LP | COL:5G9 | COL:DGC | COL:93MY9 | COL:93N4T | COL:8ZZ5K | COL:5TRRN unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus | species https://www.catalogueoflife.org/data/taxon/5TRRN
Aglais ioformis SYNONYM_OF COL:5TRRN Aglais urticae (Linnaeus, 1758) species Biota | Animalia | Arthropoda | Insecta | Lepidoptera | Papilionoidea | Nymphalidae | Nymphalinae | Nymphalini | Aglais | Aglais urticae COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:LP | COL:5G9 | COL:DGC | COL:93MY9 | COL:93N4T | COL:8ZZ5K | COL:5TRRN unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus | species https://www.catalogueoflife.org/data/taxon/5TRRN
And, when using a recent version of Nomer's Corpus of Taxonomic resources v0.7 -
Poelen, Jorrit H. (2022). Nomer Corpus of Taxonomic Resources hash://sha256/b3742bf43d9da0a8ed5522659199f47d68d31aaf46c90381190f324c1ac143f2 hash://md5/26a9b6c796567b3985e8bfe750ea2341 (0.7) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7196029
the name "Aglais io" does appear as an accepted name:
$ preston alias 'https://download.catalogueoflife.org/col/latest_coldp.zip' --remote https://zenodo.org/record/7196029/files/
<https://download.catalogueoflife.org/col/latest_coldp.zip> <http://purl.org/pav/hasVersion> <hash://sha256/5a7731841c26a76e8c5da2f9b413f413c8cdfcabe7a57d9ac636bd2136ed64d8> <urn:uuid:24882095-69b4-4a0a-b9aa-492db73a787d> .
<https://download.catalogueoflife.org/col/latest_coldp.zip> <http://purl.org/pav/hasVersion> <hash://sha256/428d1a32d0747ec2cc36cd276bcdda8e43a4cc452f6edc767eda2b0027d5f1e9> <urn:uuid:9ba86e80-c1de-480e-aeac-edac3d44c81b> .
<https://download.catalogueoflife.org/col/latest_coldp.zip> <http://purl.org/pav/hasVersion> <hash://sha256/9ac28297a996e02f6026c40d24e67f59f7f39d495bb45759ebc4adb475d51f63> <urn:uuid:f4c99a9e-401f-48cf-b742-23408d17a4f3> .
jorrit@larus:~/tmp/alignment$ preston cat 'zip:hash://sha256/9ac28297a996e02f6026c40d24e67f59f7f39d495bb45759ebc4adb475d51f63!/NameUsage.tsv' --remote https://zenodo.org/record/7196029/files/ | grep "Aglais io"
[https://zenodo.org/recor...bb45759ebc4adb475d51f63] 100.0% of 322 MB at 2.30 MB/s completed in 2 minute(s)
93Q3Q 55434 8ZZ5K 946GJ accepted Aglais io (Linnaeus, 1758) species Aglais io zoological acceptable false
94CDM 55434 93Q3Q 9599L accepted Aglais io geisha (Stichel, 1908) subspecies Aglais io geisha zoological acceptable false
65RRF 55434 5TRRN synonym Aglais ioformis Reuss, 1909 species Aglais ioformis zoological
65S3F 55434 5TRRN synonym Aglais ioprotoformis Reuss, 1909 species Aglais ioprotoformis zoological
So, somewhere between Nomer's Corpus of Taxonomic Resource v0.4 and v0.7, Catalogue of Life updated their name list to include the name "Aglais io (Linnaeus, 1758)" was added to the Catalogue of Life.
See associated Catalogue of Life page at:
https://www.catalogueoflife.org/data/taxon/93Q3Q and attached screenshot.
According to the Catalogue of Life, the name Agrais io was sourced from the Global Lepidoptera Index [1], and updated as recent as 2022-10-19 / 2022-10-19 .
And, from the associated checklistbank page https://www.checklistbank.org/dataset/55434/about, it appears that the resource was first created on April 29th 2022, 1:01:35 am by @dhobern . See screenshot below. But, I cannot figure out, from the catalogue of life pages, what changes between April 2022 and Oct 2022. In other words, from what I can tell, Catalogue of Life public facing resources do not make explicit claims on which resource contributed to what Catalogue of Life release.
References
[1] Beccaloni, G., Scoble, M., Kitching, I., Simonsen, T., Robinson, G., Pitkin, B., Hine, A., Lyal, C., Ollerenshaw, J., Wing, P., & Hobern, D. (2022). Global Lepidoptera Index. In O. Bánki, Y. Roskov, M. Döring, G. Ower, L. Vandepitte, D. Hobern, D. Remsen, P. Schalk, R. E. DeWalt, M. Keping, J. Miller, T. Orrell, R. Aalbu, R. Adlard, E. M. Adriaenssens, C. Aedo, E. Aescht, N. Akkari, S. Alexander, et al., Catalogue of Life Checklist (Version 2022-10-19). https://doi.org/10.48580/dfqf-49xk
Note , as far as I can tell, that the DOI associated with checklists used by Catalogue of Life remain unchanged even when their content is updated. So, when citing a checklist, you don't cite a specific version of the checklist, but the concept of the checklist.
Hi @jhpoelen - it's good to see you looking into the provenance and transformation of source data into COL and how this relates to versioning. I would agree that the processes need to be much more transparent.
In regard to this example, I can make some comments. First, COL for many years relied on the NHM version of LepIndex for its Lepidoptera classification. This was certainly the most complete resource available, but deeply flawed in many ways.
A version of LepIndex was imported a few years ago into TaxonWorks and has been somewhat cleaned at least for some sections of the order, although butterflies and noctuoid moths still have MANY issues. You can read some more here: https://stangeia.hobern.net/global-lepidoptera-index/. Among other things, I have reviewed the names within GBIF for which the largest numbers of records were unresolved by COL/LepIndex and have fixed many of these, including Aglais io.
The update dates in ChecklistBank/COL relate to the dataset as a whole rather than individual species. This is undesirable, and COL does have mechanisms for datasets to supply actual per-record curation dates. You can see an example here for a moth from a family that I (currently) still curate outside TaxonWorks: https://www.catalogueoflife.org/data/taxon/98XPG - at present TaxonWorks is not exposing this level of detail. I should explore with them how they can offer it.
@dhobern thanks for taking the time to respond and for providing context for the [Aglais io] name review that Jeff Ollerton noticed.
I've added a note at https://github.com/bio-guoda/preston/issues/198 to have a peek at ways to track checklist bank and their related source archives. Am open to suggestions.
fyi @seltmann
In a newer version of Nomer using a newer version of Catalogue of Life, the Aglais io
is picked up as expected.
echo -e "\tAglais io" | nomer append col
produced:
Aglais io HAS_ACCEPTED_NAME COL:93Q3Q Aglais io (Linnaeus, 1758) species Biota | Animalia | Arthropoda | Insecta | Lepidoptera | Papilionoidea | Nymphalidae | Nymphalinae | Nymphalini | Aglais | Aglais io COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:LP | COL:5G9 | COL:DGC | COL:93MY9 | COL:93N4T | COL:8ZZ5K | COL:93Q3Q unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus | species https://www.catalogueoflife.org/data/taxon/93Q3Q
with
$ nomer properties | grep preston
nomer.preston.dir=
nomer.preston.remotes=https://zenodo.org/record/7196029/files
nomer.preston.version=hash://sha256/b3742bf43d9da0a8ed5522659199f47d68d31aaf46c90381190f324c1ac143f2
Closing issue because the apparent Aglais io
alignment inconsistency was due to the underlying versions of Catalogue of Life.
from exchange with Jeff Ollerton https://github.com/globalbioticinteractions/ollerton2022 -
According to a 2022-10-21 GloBI review
the name "Aglais io" was unable to be aligned with Catalogue of Life -
produced;
However, on using a recent version of Nomer and their Catalogue of Life, I was, like you suggested, able to match Aglais io.
yielded: