globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

support offline GBIF backbone matcher #40

Closed jhpoelen closed 3 years ago

jhpoelen commented 3 years ago

as suggested by @zedomel

jhpoelen commented 3 years ago

Initial testing shows that ~ 1M names can be resolved by name in about 1 minute via preliminary versions of nomer gbif taxonomic backbone offline matchers:

# get a list of ~ 1M names
$ curl "https://zenodo.org/record/5021869/files/names.tsv.gz"  > names.tsv.gz

# do first match to trigger building of local index (take a couple of minutes depending on internet connection)
$ echo -e "\tHomo sapiens" | nomer append gbif-taxon
...
    Homo sapiens    SAME_AS GBIF:2436436    Homo sapiens    species     Animalia | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens GBIF:1 | GBIF:44 | GBIF:359 | GBIF:798 | GBIF:5483 | GBIF:2436435 | GBIF:2436436kingdom | phylum | class | order | family | genus | species http://www.gbif.org/species/2436436 

# now match ~ 1M names against locally indexed GBIF backbone (no internet connection needed)
$ time cat names.tsv.gz | gunzip | cut -f1,2 | pv -l | nomer append gbif-taxon > /dev/null
...
1.06M 0:00:40 [26.0k/s] [                                         <=>          ]

real    0m41.385s
user    1m31.704s
sys 0m2.647s
jhpoelen commented 3 years ago

Note that an initial implementation of the offline enabled GBIF backbone taxonomy matcher has been included in:

https://github.com/globalbioticinteractions/nomer/releases/0.2.0

and an associated data publication was created to help facilitate the speedy construction of local indexes:

Poelen, Jorrit H. (2021). A Repackaged Taxonomic Backbone of Global Biodiversity Information Facility (GBIF) (0.2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5222044

jhpoelen commented 3 years ago

Referenced in https://discourse.gbif.org/t/looking-for-offline-enabled-name-id-lookup-in-gbif-taxonomy-backbone-with-10k-matches-s/3019 .