sckott / pytaxize

python port of taxize (taxonomy toolbelt) for R
https://sckott.github.io/pytaxize/
MIT License
34 stars 13 forks source link

Allow names or only taxonomic ids as input to higher level methods? #58

Open sckott opened 4 years ago

sckott commented 4 years ago

An aspect that's different from R taxize is that I didn't want to bring the interactive part to this package. That is, the taxize get* fxns have a prompt if there's more than one result for a taxon name against a given data source, letting the user pick which taxon. BUT, that's not reproducible and requires an interactive session. The various higher level functions in R taxize like classification() allow input of not just ids but taxonomic names because it passes names to get* fxns which then result in a single taxonomic id before fetching the classification. However, here we don't have the prompt thing, so i think for higher level methods like Classification/Children we should only allow taxonomic ids as input. thoughts @Daniel-Davies ?

Daniel-Davies commented 4 years ago

In a previous project, when I had this issue, I decided to use a "consensus" protocol on the results of the API. That is, from the list of results returned by the API, taking the most commonly occuring value is usually enough to satisfy the query. Taking a classification example; trying GNR with "panthera tigris" returns 11 separate results; for "species", all are in agreement of "panthera tigris". For genus perhaps, 6 results may have "panthera", while 1 will have "puma", so we take "panthera". Repeating this for each key gives a sort of approximation to the classification of the entered name from the multiple sources that turns out to be reasonably robust.

I think the ID approach is good, since it gives the user an option of determinism, and it definitely needs to be a part of the package. However, if someone is willing to accept the risks, could they also try a "most-common-value-wins" approach? I'm not very trained in Taxonomy so I don't know if this is valid...

sckott commented 4 years ago

That's a good idea for selecting names. We do that in the R get_ fxns, we look for an exact match, and if there is one return that match. It could be more complicated than that of course. So sounds like we should for the Ids class avoid the interactive/prompt thing and try a best effort approach to returning a single id.

For the higher level methods (e.g., classification) sounds like we go with ONLY allowing IDs as inputs, correct? so users have to get IDs first, either using IDs class or some other method