ropensci / taxizedb

Tools for Working with Taxonomic SQL Databases
Other
30 stars 7 forks source link

How to get ta taxonomy table from taxizedb? #64

Open GossypiumH opened 1 year ago

GossypiumH commented 1 year ago

Hello,

In January I encountered a problem with taxize API due to my number of bacterial taxa from witch I want to retrieve taxonomy (10k+) (I posted about my problem here : https://github.com/ropensci/taxize/issues/907)

People advised me to use taxizedb, it works offline and should fix my problem. However, when I try to apply a simple command as:

test = classification(name2taxid(c(taxa$specie_ID)))

taxa is a dataframe with only one collumn named specie_ID, as flolow:

> head(taxa$specie_ID) [1] "Staphylococcus sp." "Acinetobacter sp." "Cutibacterium sp." "Sphingomonas sp." "Paenarthrobacter sp." [6] "Paracoccus sp."

However, I receive an error:

> test = classification(name2taxid(c(taxa$specie_ID))) Error in name2taxid(c(taxa$specie_ID)) : Some of the input names are ambiguous, try setting out_type to 'summary'

When I set out_type to summary; I got that:

> test = classification(name2taxid(c(taxa$specie_ID), out_type="summary")) Error indplyr::summarize(): ℹ In argument:taxids = paste(.data$tax_id, collapse = "|"). ℹ In group 1:name = "Morganella sp.". Caused by error in.data$tax_id: ! Columntax_idnot found in.data`. Backtrace:

  1. taxizedb::classification(name2taxid(c(taxa$specie_ID), out_type = "summary"))
    1. rlang:::abort_data_pronoun(x, call = y)`

Apparently Morganella sp. is not recognized by taxizedb. I'm not particularly familiar with dplyr of with taxize. So I just would like to know, how I could retrieve the taxonomy for each of my species of bacteria, preferentially in the form of a table with collumns like that:

Specie_ID Kindom Phyllum Class Order family genus

stitam commented 1 year ago

Thanks @GossypiumH for raising this issue.

The issue is caused by taxons that can be linked with multiple taxids:

taxizedb::name2taxid("morganella", out_type = "summary")
#> # A tibble: 3 × 2
#>   name       id    
#>   <chr>      <chr> 
#> 1 morganella 581   
#> 2 morganella 90690 
#> 3 morganella 108061

Created on 2023-03-01 with reprex v2.0.2

A very small change to your approach should solve your issue: Run classification() on the id column of the name2taxid() output, not the whole object (maybe this is what you wanted to do in the first place, so it's just a typo thing?):

test = classification(name2taxid(c("morganella", "escherichia"), out_type = "summary")$id)

However, taxons with multiple taxids will inflate the number elements in your results which can cause problems in your downstream analysis. Because of this I would probably run name2taxid(out_type = "summary") first, resolve taxons with multiple taxids (investigate them manually, choose one and remove the rest from the tibble) and the then run classification()` on the data set with distinct taxons. I imagine there shouldn't be many taxons with multiple taxids.

Do you think this approach could be feasible?

GossypiumH commented 1 year ago

Hello,

Thank you for your reply ! I will try your solution, I hope I will not have too many taxon with multiple taxID.

Cheers,