How to get ta taxonomy table from taxizedb? #64

In January I encountered a problem with taxize API due to my number of bacterial taxa from witch I want to retrieve taxonomy (10k+) (I posted about my problem here :

People advised me to use taxizedb, it works offline and should fix my problem. However, when I try to apply a simple command as:

test = classification(name2taxid(c(taxa$specie_ID)))

taxa is a dataframe with only one collumn named specie_ID, as flolow:

> head(taxa$specie_ID) [1] "Staphylococcus sp." "Acinetobacter sp." "Cutibacterium sp." "Sphingomonas sp." "Paenarthrobacter sp." [6] "Paracoccus sp."

However, I receive an error:

> test = classification(name2taxid(c(taxa$specie_ID))) Error in name2taxid(c(taxa$specie_ID)) : Some of the input names are ambiguous, try setting out_type to 'summary'

When I set out_type to summary; I got that:

> test = classification(name2taxid(c(taxa$specie_ID), out_type="summary")) Error indplyr::summarize(): ℹ In argument:taxids = paste(.data$tax_id, collapse = "|"). ℹ In group 1:name = "Morganella sp.". Caused by error$tax_id: ! Columntax_idnot found`. Backtrace:

  1. taxizedb::classification(name2taxid(c(taxa$specie_ID), out_type = "summary"))
    1. rlang:::abort_data_pronoun(x, call = y)`

Apparently Morganella sp. is not recognized by taxizedb. I'm not particularly familiar with dplyr of with taxize. So I just would like to know, how I could retrieve the taxonomy for each of my species of bacteria, preferentially in the form of a table with collumns like that:

Specie_ID Kindom Phyllum Class Order family genus

Thanks @GossypiumH for raising this issue.

The issue is caused by taxons that can be linked with multiple taxids:

taxizedb::name2taxid("morganella", out_type = "summary")
#> # A tibble: 3 × 2
#>   name       id    
#>   <chr>      <chr> 
#> 1 morganella 581   
#> 2 morganella 90690 
#> 3 morganella 108061

A very small change to your approach should solve your issue: Run classification() on the id column of the name2taxid() output, not the whole object (maybe this is what you wanted to do in the first place, so it's just a typo thing?):

test = classification(name2taxid(c("morganella", "escherichia"), out_type = "summary")$id)

However, taxons with multiple taxids will inflate the number elements in your results which can cause problems in your downstream analysis. Because of this I would probably run name2taxid(out_type = "summary") first, resolve taxons with multiple taxids (investigate them manually, choose one and remove the rest from the tibble) and the then run classification()` on the data set with distinct taxons. I imagine there shouldn't be many taxons with multiple taxids.

Do you think this approach could be feasible?

Thank you for your reply ! I will try your solution, I hope I will not have too many taxon with multiple taxID.
