globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
18 stars 3 forks source link

inconsistent name alignment review for [Aglais io] using different versions of Catalogue of Life matcher #124

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

from exchange with Jeff Ollerton https://github.com/globalbioticinteractions/ollerton2022 -

According to a 2022-10-21 GloBI review

$ curl --silent  https://depot.globalbioticinteractions.org/reviews/globalbioticinteractions/ollerton2022/README.txt | head -n20
   _____ _       ____ _____   _____            _                
  / ____| |     |  _ \_   _| |  __ \          (_)               
 | |  __| | ___ | |_) || |   | |__) |_____   ___  _____      __ 
 | | |_ | |/ _ \|  _ < | |   |  _  // _ \ \ / / |/ _ \ \ /\ / / 
 | |__| | | (_) | |_) || |_  | | \ \  __/\ V /| |  __/\ V  V /  
  \_____|_|\___/|____/_____| |_|  \_\___| \_/ |_|\___| \_/\_/   
 | |           |  ____| | |                                     
 | |__  _   _  | |__  | | |_ ___  _ __                          
 | '_ \| | | | |  __| | | __/ _ \| '_ \                         
 | |_) | |_| | | |____| | || (_) | | | |                        
 |_.__/ \__, | |______|_|\__\___/|_| |_|                        
         __/ |                                                  
        |___/                                                   
⚠️ Disclaimer: The results in this review should be considered
friendly, yet naive, notes from an unsophisticated robot. 
Please carefully review the results listed below and share issues/ideas
by email info at globalbioticinteractions.org or by opening an issue at 
https://github.com/globalbioticinteractions/globalbioticinteractions/issues .

Review of [globalbioticinteractions/ollerton2022] started at [2022-10-21T17:40:06+02:00].

the name "Aglais io" was unable to be aligned with Catalogue of Life -

curl --silent https://depot.globalbioticinteractions.org/reviews/globalbioticinteractions/ollerton2022/indexed-names-resolved-col.tsv\
 | grep NONE\
 | grep "Aglais io"

produced;

    Aglais io   species     Lepidoptera | Nymphalidae | Aglais | Aglais io  order | family | genus | species    globalbioticinteractions/ollerton2022   Ollerton, J., Trunschke, J. ., Havens, K. ., Landaverde-González, P. ., Keller, A. ., Gilpin, A.-M. ., Rodrigo Rech, A. ., Baronio, G. J. ., Phillips, B. J., Mackin, C. ., Stanley, D. A., Treanore, E. ., Baker, E. ., Rotheray, E. L., Erickson, E. ., Fornoff, F. ., Brearley, F. Q. ., Ballantyne, G. ., Iossa, G. ., Stone, G. N., Bartomeus, I. ., Stockan, J. A., Leguizamón, J., Prendergast, K. ., Rowley, L., Giovanetti, M., de Oliveira Bueno, R., Wesselingh, R. A., Mallinger, R., Edmondson, S., Howard, S. R., Leonhardt, S. D., Rojas-Nossa, S. V., Brett, M., Joaqui, T., Antoniazzi, R., Burton, V. J., Feng, H.-H., Tian, Z.-X., Xu, Q., Zhang, C., Shi, C.-L., Huang, S.-Q., Cole, L. J., Bendifallah, L., Ellis, E. E., Hegland, S. J., Straffon Díaz, S., Lander, T. A. ., Mayr, A. V., Dawson, R. ., Eeraerts, M. ., Armbruster, W. S. ., Walton, B. ., Adjlane, N. ., Falk, S. ., Mata, L. ., Goncalves Geiger, A. ., Carvell, C. ., Wallace, C. ., Ratto, F. ., Barberis, M. ., Kahane, F. ., Connop, S. ., Stip, A. ., Sigrist, M. R. ., Vereecken, N. J. ., Klein, A.-M., Baldock, K. ., & Arnold, S. E. J. . (2022). Pollinator-flower interactions in gardens during the COVID-19 pandemic lockdown of 2020. Journal of Pollination Ecology, 31, 87–96. https://doi.org/10.26786/1920-7603(2022)695  https://github.com/globalbioticinteractions/ollerton2022/archive/7eb71e8e5026ec08c04a69a09860f8927061a8fd.zip   2022-10-21T15:40:02.623Z    fdf98a0bfb924fed5cb768249538e68493105d9d2dce957d278aff4dfc5b7442    0.12.4  NONE    Aglais io   

However, on using a recent version of Nomer and their Catalogue of Life, I was, like you suggested, able to match Aglais io.

curl --silent https://depot.globalbioticinteractions.org/reviews/globalbioticinteractions/ollerton2022/indexed-names-resolved-col.tsv\ 
 | grep NONE\
 | cut -f1,2\
 | nomer append col\
 | grep -v NONE\
 | grep "Aglais io"

yielded:

    Aglais io    HAS_ACCEPTED_NAME    COL:93Q3Q    Aglais io    (Linnaeus, 1758)    species        Biota | Animalia | Arthropoda | Insecta | Lepidoptera | Papilionoidea | Nymphalidae | Nymphalinae | Nymphalini | Aglais | Aglais io    COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:LP | COL:5G9 | COL:DGC | COL:93MY9 | COL:93N4T | COL:8ZZ5K | COL:93Q3Q    unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus | species    https://www.catalogueoflife.org/data/taxon/93Q3Q
jhpoelen commented 1 year ago

related to https://github.com/globalbioticinteractions/globalbioticinteractions/issues/816 .

jhpoelen commented 1 year ago

The name alignment review process used was:

https://github.com/globalbioticinteractions/globinizer/blob/348bf30914df2da3d9c0840f695768beceae8e1d/align-names.sh

which uses Nomer v0.2.13 (April 2022) in combination with pre-processed COL index - https://github.com/globalbioticinteractions/nomer/releases/download/0.2.13/catalogue_of_life_mapdb.zip .

https://github.com/globalbioticinteractions/nomer/releases/tag/0.2.13

jhpoelen commented 1 year ago

After installing the COL pre-computed index and Nomer 0.2.13, I was able to reproduce the suspected false negative match for "Aglais io" -

installing versions -

curl -L "https://github.com/globalbioticinteractions/nomer/releases/download/0.2.13/nomer.jar" > nomer.jar
curl -L "https://github.com/globalbioticinteractions/nomer/releases/download/0.2.13/catalogue_of_life_mapdb.zip" > catalogue_of_life_mapdb.zip

create nomer cache and unpack COL index

mkdir .nomer
cd .nomer && unzip ../catalogue_of_life_mapdb.zip

to reproduce the suspicious result:

echo -e "\tAglais io" | java -jar nomer.jar append col

yielded:

[...]
    Aglais io   NONE        Aglais io               
jhpoelen commented 1 year ago

Same results were obtained after rebuilding the Catalogue of Life index using Nomer v0.2.13 -

$ echo -e "\tAglais io" | java -jar nomer.jar append col
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [col]
[main] INFO org.globalbioticinteractions.nomer.match.CatalogueOfLifeTaxonService - [CATALOGUE_OF_LIFE] taxonomy importing...
[main] INFO org.globalbioticinteractions.nomer.match.ResourceServiceContentBased - using local Preston data dir: [/home/jorrit/tmp/alignment/./.nomer/data]
[main] INFO org.globalbioticinteractions.nomer.match.ResourceServiceContentBased - caching [zip:https://download.catalogueoflife.org/col/latest_coldp.zip!/NameUsage.tsv] at [/home/jorrit/tmp/alignment/./.nomer/tmp/nomer10203795427066328567.gz]...
[https://zenodo.org/recor...210ef59e06b640a3539cb5a] 100.0% of 78 bytes at 0.01 MB/s completed in < 1 minute
[https://zenodo.org/recor...eac1915f7c3be9748eda991] 100.0% of 22 kB at 5.46 MB/s completed in < 1 minute
[https://zenodo.org/recor...dd472e8ae9c3f66f4932c62] 100.0% of 78 bytes at 0.07 MB/s completed in < 1 minute
[https://zenodo.org/recor...4e871ad1ec0a8b283091e08] 100.0% of 26 kB at 13.07 MB/s completed in < 1 minute
[https://zenodo.org/recor...ac9a12e14bfa72ccb0a6828] 100.0% of 78 bytes at 0.07 MB/s completed in < 1 minute
[https://zenodo.org/recor...f20e2eedbb2fb585bdf0822] 100.0% of 32 kB at 15.91 MB/s completed in < 1 minute
[https://zenodo.org/recor...5bba06ad7b40eea7c8c9831] 100.0% of 78 bytes at 0.07 MB/s completed in < 1 minute
[https://zenodo.org/recor...ae02993ebf7a0fe30d87137] 100.0% of 2 kB at ? MB/s completed in < 1 minute
[https://zenodo.org/recor...7a57d9ac636bd2136ed64d8] 0.0% of 439 MB at 0.23 MB/s[https://zenodo.org/recor...7a57d9ac636bd2136ed64d8]
[...]
[main] INFO org.globalbioticinteractions.nomer.match.CatalogueOfLifeTaxonService - cache with [4438467] items built in [912.3] s or [4865.2] items/s.
[main] INFO org.globalbioticinteractions.nomer.match.CatalogueOfLifeTaxonService - [CATALOGUE_OF_LIFE] taxonomy imported.
    Aglais io   NONE        Aglais io                           

using Nomer's Corpus of Taxonomic Resources v0.4

Poelen, Jorrit H. (2022). Nomer Corpus of Taxonomic Resources hash://sha256/6224f259190590c7aed4784de2b27b3005eea0042ae02993ebf7a0fe30d87137 (0.4) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6473194 -

$  java -jar nomer.jar properties | grep preston
nomer.preston.dir=
nomer.preston.remotes=https://zenodo.org/record/6473194/files
nomer.preston.version=hash://sha256/d58ab1acf350f056a75bde7f4175d14c5e4dfaf0bf20e2eedbb2fb585bdf0822
jhpoelen commented 1 year ago

and, on retrieving the exact version of the CoL resource via:

preston cat 'zip:https://download.catalogueoflife.org/col/latest_coldp.zip!/NameUsage.tsv' --remote https://zenodo.org/record/6473194/files  | grep "Aglais io"

yielded:

65S3F   1018    5TRRN       synonym Aglais ioprotoformis    Reuss, 1909 species     Aglais      ioprotoformis                       zoological                                                                              
65RRF   1018    5TRRN       synonym Aglais ioformis Reuss, 1909 species     Aglais      ioformis                            

So, for some reason "Aglais ioprotoformis" and "Aglais ioformis" were included in COL as included in Nomer Corpus of Taxonomic Resource v0.4, but not "Aglais io".

This is consistent with the results produced by Nomer v0.2.13 -

$ echo -e "\tAglais ioprotoformis\n\tAglais ioformis" | nomer append col
    Aglais ioprotoformis    SYNONYM_OF  COL:5TRRN   Aglais urticae  (Linnaeus, 1758)    species     Biota | Animalia | Arthropoda | Insecta | Lepidoptera | Papilionoidea | Nymphalidae | Nymphalinae | Nymphalini | Aglais | Aglais urticae    COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:LP | COL:5G9 | COL:DGC | COL:93MY9 | COL:93N4T | COL:8ZZ5K | COL:5TRRN    unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus | species        https://www.catalogueoflife.org/data/taxon/5TRRN
    Aglais ioformis SYNONYM_OF  COL:5TRRN   Aglais urticae  (Linnaeus, 1758)    species     Biota | Animalia | Arthropoda | Insecta | Lepidoptera | Papilionoidea | Nymphalidae | Nymphalinae | Nymphalini | Aglais | Aglais urticae    COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:LP | COL:5G9 | COL:DGC | COL:93MY9 | COL:93N4T | COL:8ZZ5K | COL:5TRRN    unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus | species        https://www.catalogueoflife.org/data/taxon/5TRRN
jhpoelen commented 1 year ago

And, when using a recent version of Nomer's Corpus of Taxonomic resources v0.7 -

Poelen, Jorrit H. (2022). Nomer Corpus of Taxonomic Resources hash://sha256/b3742bf43d9da0a8ed5522659199f47d68d31aaf46c90381190f324c1ac143f2 hash://md5/26a9b6c796567b3985e8bfe750ea2341 (0.7) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7196029

the name "Aglais io" does appear as an accepted name:

$ preston alias 'https://download.catalogueoflife.org/col/latest_coldp.zip' --remote https://zenodo.org/record/7196029/files/
<https://download.catalogueoflife.org/col/latest_coldp.zip> <http://purl.org/pav/hasVersion> <hash://sha256/5a7731841c26a76e8c5da2f9b413f413c8cdfcabe7a57d9ac636bd2136ed64d8> <urn:uuid:24882095-69b4-4a0a-b9aa-492db73a787d> .
<https://download.catalogueoflife.org/col/latest_coldp.zip> <http://purl.org/pav/hasVersion> <hash://sha256/428d1a32d0747ec2cc36cd276bcdda8e43a4cc452f6edc767eda2b0027d5f1e9> <urn:uuid:9ba86e80-c1de-480e-aeac-edac3d44c81b> .
<https://download.catalogueoflife.org/col/latest_coldp.zip> <http://purl.org/pav/hasVersion> <hash://sha256/9ac28297a996e02f6026c40d24e67f59f7f39d495bb45759ebc4adb475d51f63> <urn:uuid:f4c99a9e-401f-48cf-b742-23408d17a4f3> .
jorrit@larus:~/tmp/alignment$ preston cat 'zip:hash://sha256/9ac28297a996e02f6026c40d24e67f59f7f39d495bb45759ebc4adb475d51f63!/NameUsage.tsv' --remote https://zenodo.org/record/7196029/files/ | grep "Aglais io"
[https://zenodo.org/recor...bb45759ebc4adb475d51f63] 100.0% of 322 MB at 2.30 MB/s completed in 2 minute(s)
93Q3Q   55434   8ZZ5K   946GJ   accepted    Aglais io   (Linnaeus, 1758)    species         Aglais      io                  zoological  acceptable                              false                                       
94CDM   55434   93Q3Q   9599L   accepted    Aglais io geisha    (Stichel, 1908) subspecies          Aglais      io  geisha          zoological  acceptable                              false                                       
65RRF   55434   5TRRN       synonym Aglais ioformis Reuss, 1909 species         Aglais      ioformis                        zoological                                                                              
65S3F   55434   5TRRN       synonym Aglais ioprotoformis    Reuss, 1909 species         Aglais      ioprotoformis                   zoological                                                                              

So, somewhere between Nomer's Corpus of Taxonomic Resource v0.4 and v0.7, Catalogue of Life updated their name list to include the name "Aglais io (Linnaeus, 1758)" was added to the Catalogue of Life.

See associated Catalogue of Life page at:

https://www.catalogueoflife.org/data/taxon/93Q3Q and attached screenshot.

image

jhpoelen commented 1 year ago

According to the Catalogue of Life, the name Agrais io was sourced from the Global Lepidoptera Index [1], and updated as recent as 2022-10-19 / 2022-10-19 .

image

image

And, from the associated checklistbank page https://www.checklistbank.org/dataset/55434/about, it appears that the resource was first created on April 29th 2022, 1:01:35 am by @dhobern . See screenshot below. But, I cannot figure out, from the catalogue of life pages, what changes between April 2022 and Oct 2022. In other words, from what I can tell, Catalogue of Life public facing resources do not make explicit claims on which resource contributed to what Catalogue of Life release.

image

References

[1] Beccaloni, G., Scoble, M., Kitching, I., Simonsen, T., Robinson, G., Pitkin, B., Hine, A., Lyal, C., Ollerenshaw, J., Wing, P., & Hobern, D. (2022). Global Lepidoptera Index. In O. Bánki, Y. Roskov, M. Döring, G. Ower, L. Vandepitte, D. Hobern, D. Remsen, P. Schalk, R. E. DeWalt, M. Keping, J. Miller, T. Orrell, R. Aalbu, R. Adlard, E. M. Adriaenssens, C. Aedo, E. Aescht, N. Akkari, S. Alexander, et al., Catalogue of Life Checklist (Version 2022-10-19). https://doi.org/10.48580/dfqf-49xk

jhpoelen commented 1 year ago

Note , as far as I can tell, that the DOI associated with checklists used by Catalogue of Life remain unchanged even when their content is updated. So, when citing a checklist, you don't cite a specific version of the checklist, but the concept of the checklist.

dhobern commented 1 year ago

Hi @jhpoelen - it's good to see you looking into the provenance and transformation of source data into COL and how this relates to versioning. I would agree that the processes need to be much more transparent.

In regard to this example, I can make some comments. First, COL for many years relied on the NHM version of LepIndex for its Lepidoptera classification. This was certainly the most complete resource available, but deeply flawed in many ways.

A version of LepIndex was imported a few years ago into TaxonWorks and has been somewhat cleaned at least for some sections of the order, although butterflies and noctuoid moths still have MANY issues. You can read some more here: https://stangeia.hobern.net/global-lepidoptera-index/. Among other things, I have reviewed the names within GBIF for which the largest numbers of records were unresolved by COL/LepIndex and have fixed many of these, including Aglais io.

The update dates in ChecklistBank/COL relate to the dataset as a whole rather than individual species. This is undesirable, and COL does have mechanisms for datasets to supply actual per-record curation dates. You can see an example here for a moth from a family that I (currently) still curate outside TaxonWorks: https://www.catalogueoflife.org/data/taxon/98XPG - at present TaxonWorks is not exposing this level of detail. I should explore with them how they can offer it.

jhpoelen commented 1 year ago

@dhobern thanks for taking the time to respond and for providing context for the [Aglais io] name review that Jeff Ollerton noticed.

I've added a note at https://github.com/bio-guoda/preston/issues/198 to have a peek at ways to track checklist bank and their related source archives. Am open to suggestions.

jhpoelen commented 1 year ago

fyi @seltmann

jhpoelen commented 1 year ago

In a newer version of Nomer using a newer version of Catalogue of Life, the Aglais io is picked up as expected.

echo -e "\tAglais io" | nomer append col

produced:

    Aglais io   HAS_ACCEPTED_NAME   COL:93Q3Q   Aglais io   (Linnaeus, 1758)    species     Biota | Animalia | Arthropoda | Insecta | Lepidoptera | Papilionoidea | Nymphalidae | Nymphalinae | Nymphalini | Aglais | Aglais io COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:LP | COL:5G9 | COL:DGC | COL:93MY9 | COL:93N4T | COL:8ZZ5K | COL:93Q3Q    unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus | species        https://www.catalogueoflife.org/data/taxon/93Q3Q

with

$ nomer properties | grep preston
nomer.preston.dir=
nomer.preston.remotes=https://zenodo.org/record/7196029/files
nomer.preston.version=hash://sha256/b3742bf43d9da0a8ed5522659199f47d68d31aaf46c90381190f324c1ac143f2
jhpoelen commented 1 year ago

Closing issue because the apparent Aglais io alignment inconsistency was due to the underlying versions of Catalogue of Life.