Closed jhpoelen closed 3 years ago
Initial testing shows that ~ 1M names can be resolved by name in about 1 minute via preliminary versions of nomer gbif taxonomic backbone offline matchers:
# get a list of ~ 1M names
$ curl "https://zenodo.org/record/5021869/files/names.tsv.gz" > names.tsv.gz
# do first match to trigger building of local index (take a couple of minutes depending on internet connection)
$ echo -e "\tHomo sapiens" | nomer append gbif-taxon
...
Homo sapiens SAME_AS GBIF:2436436 Homo sapiens species Animalia | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens GBIF:1 | GBIF:44 | GBIF:359 | GBIF:798 | GBIF:5483 | GBIF:2436435 | GBIF:2436436kingdom | phylum | class | order | family | genus | species http://www.gbif.org/species/2436436
# now match ~ 1M names against locally indexed GBIF backbone (no internet connection needed)
$ time cat names.tsv.gz | gunzip | cut -f1,2 | pv -l | nomer append gbif-taxon > /dev/null
...
1.06M 0:00:40 [26.0k/s] [ <=> ]
real 0m41.385s
user 1m31.704s
sys 0m2.647s
Note that an initial implementation of the offline enabled GBIF backbone taxonomy matcher has been included in:
https://github.com/globalbioticinteractions/nomer/releases/0.2.0
and an associated data publication was created to help facilitate the speedy construction of local indexes:
Poelen, Jorrit H. (2021). A Repackaged Taxonomic Backbone of Global Biodiversity Information Facility (GBIF) (0.2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5222044
as suggested by @zedomel