Using taxize and taxizedb

brunobrr commented 3 years ago

Dear Scott,

I would like to know if it is possible to use the resolver taxonomic names functions of taxize (e.g., gnr_database) by using taxonomic databases downloaded using the function dbdownload* from taxizedb package.

Best regards, Bruno

sckott commented 3 years ago

You can't do exactly what the GNR web service does as it has special permission to have some databases that are not open to the public. Of the ones open and available in taxizedb then it is possible. However, I'm not exactly sure how it all works on the backend. The code pointed at behind the resolver https://github.com/GlobalNamesArchitecture/gni hasn't had activity since 2017 so I doubt that's the code behind GNR.

@dimus What are the pieces of code behind GNR? Just at a high level. What i'm trying to figure out is could one do something similar to what GNR is doing, but locally, by combining the distinct parts behind the GNR but using local taxonomic databases.

dimus commented 3 years ago

@brunobrr, @sckott ---

GNR situation

It is complicated. https://github.com/GlobalNamesArchitecture/gni code is dormant, however, it is still functional and its database is regularly updated at https://resolver.globalnames.org. The code is not updated because it is too slow for our current tasks.

The 2nd iteration of name resolution service (https://github.com/GlobalNamesArchitecture/gnindex, service is https://index.globalnames.org/) had been written in Scala, and proved itself not to be fast enough still, very bulky and resource hungry, slow in development and hard to maintain. For this reason I started to move towards Go langugage.

The 3rd iteration (https://github.com/gnames/gnames) is under development. It is the last big Go component out of name parsing (https://gitlab.com/gogna/gnparser), name-finding (https://github.com/gnames/gnfinder) and name-resolution. When I'm done with name-resolution in Go, the speed should be ~5000 names a second for online service.

Database availability. GNR database is not closed anymore, the last problem was IPNI and they did open their name data for us. The site http://opendata.globalnames.org/dumps/ contains 2018 version of the database (resolver-2018-06-11.sql.gz). By request I can update it to the current one.
Adding new data. If you are missing some open name data-set, you can make a request at

https://github.com/GlobalNamesArchitecture/dwca_hunter/issues

I try to update major resolver datasets twice a year, and next time will be November-December 2020

If you prefer to install service locally, 1st iteration is the best bet for now. I hope that 3rd one will be the easiest to install, as I already have requests to make it portable, but I won't expect it to be out until 1st quarter of 2021. Data for all 3 iterations of name-resolvers is synchronized. (data for the Scala version of the code though is a half a year behind, as it could not scale up (no pun) to a significant increase in names that were added May-June 2020)

sckott commented 3 years ago

Thanks @dimus ! Very helpful.

Some users could definitely run the ruby service ("1st iteration") locally but for those that don't have those skills, maybe we can approximate it?

Is the basic process like so:

user taxonomic names as input
parse names - e.g., using https://gitlab.com/gogna/gnparser, or the equivalent one from Ruby https://github.com/GlobalNamesArchitecture/biodiversity or Scala
match parsed names against name databases - how is this part done?

dimus commented 3 years ago

As I have to work with much of OCR data, matching is quite complex!

major ideas for matching are described here: https://github.com/gnames/gnmatcher

dimus commented 3 years ago

gnmatcher data lies in the file system and is generated from postgres database. I guess I can make that data available at opendata.globalnames.org if people want to use matching component separately. It does not tell which dataset has the data, but it gives you matching canonical forms for a name-string

Oh, and it is insanely fast, I got to 40k names a sec, so now my problem is not matching speed, but lookup speed for datasets

sckott commented 3 years ago

thanks again for those details. Right, not surprising matching is complex.

@brunobrr Your thoughts on the discussion? I don't know what your level of comfort is with these different technologies.

brunobrr commented 3 years ago

Thanks @sckott and @dimus for your answers! They were very clarifying. @sckott, I am not familiar with the programming languages you mentioned (Go, Scale, and Rugby), but I have been using R for a while. My colleagues and I are working on a workflow for cleaning taxonomic, geographic, and temporal information of plant records using R. We think that taxizedb could fit well that aim of the project because someone can parse species names using downloaded databases.

I'll follow the developments that @dimus mentioned and hope we can use taxizedb for parsed species names in the near future. Thank you both!

dimus commented 3 years ago

@brunobrr can you explain what problem you want to solve and what are your specific needs? It would help me to understand better how/when/if I can help them.

Almost everything I do can be used as a command line app, and it is quite possible these command line apps are sufficient. Real life example from Bob Mesibov (gnparser in this case) https://www.datafix.com.au/BASHing/2019-01-20.html

sckott commented 3 years ago

@brunobrr I have wrapped Dmitry's gnparser tool in R at https://github.com/ropensci/rgnparser (should be on cran soon) - so you can do the name parsing part from R while getting really fast speed from Go.

I could try to replicate the global names resolver tooling in R, but the part that would be tough to replicate is the comparing names to databases.

brunobrr commented 3 years ago

@dimus @sckott thank you. We are trying to resolve species names compiled from several heterogenous datasets using tools available in R. We are testing several approaches, and for now, the most promising ones were those tools available in the packages taxize, taxizedb, taxadb, flora, and World Flora Online (Kindt 2020).

As we handling thousands of species names from several datasets, we chose to adopt a three-step strategy to resolve species names (comments on this are welcome). The basic process like so:

1) Parse species names using one primary taxonomic authority 2) Searching for synonyms or accepted names of unresolved names in step 1 using another taxonomic authority; 3) Use the synonyms or accepted names retrieved in step 2 to parse species names using the primary taxonomic authority (step

Accordingly, for now, we are using:

1) taxadb (which @sckott is a coauthor) to match species names against one primary taxonomic authority (e.g. COL or GBIF or ITIS, etc); However, taxadb does not perform fuzzy match 2) Use taxize to search for synonyms ou accepted names of unresolved names 3) Use the highest scored synonyms or accepted names retrieved in step 2 to resolve names against the primary taxonomic authority.

My first question on this issue was if was possible to use taxizedb in combination with gnr implemented in taxize package to resolve species names. In other words, resolving names using gnr and downloaded databases instead of quering databases using APIs.

We are working on this project and testing some packages and functions these days and some points are not clear to me yet. Please, let me know if you need further clarification.

sckott commented 3 years ago

Thanks @brunobrr for the details. I assume you are doing this all in R? Dmitry made a comment above about command line tools, which I assume you do not want to do if you're doing this all in R?

dimus commented 3 years ago

Guys at TaxonWorks use pipes methods in Ruby to achieve native speed for gnparser. All 3 command line tools are 'pipable'. Here is an example in Ruby how to do it: https://gitlab.com/gogna/gnparser#pipes

brunobrr commented 3 years ago

@sckott @dimus I really impressed with rgnparser!!! It is extremely fast and functional! I have been using rgnparser to parse names and then look for accepted names using taxadb and WorldFlora packages. Thanks a lot!

dimus commented 3 years ago

@brunobrr, I am happy gnparser is of help for you!

sckott commented 3 years ago

great, glad it works!

ropensci / taxizedb

Using taxize and taxizedb #52