ropensci / taxizedb

Tools for Working with Taxonomic SQL Databases
Other
30 stars 7 forks source link

How to unify the list generated after classification? #68

Open lauraDRH opened 1 year ago

lauraDRH commented 1 year ago

Hi!

I am struggling to put together the output of classification().

I have a list of IDs (ids) that I wanted to get the different taxonomic levels from, so I ran the code:

taxa <- classification(ids , rank = "genus", db= "ncbi")

This worked completely fine, but it generated a list of data frames. One dataframe per ID.

I would like to put together all the dataframes, and obtain a table that has these columns: ID, phylum, order, class, family and genus but I do not know how to merge them.

Thanks for the suggestions!

stitam commented 1 year ago

Thanks @lauraDRH for opening this issue!

Currently this needs some downstream processing:

# some organisms we are interested in
ids <- c("Desulfovibrio desulfuricans", "Nitrosomonas halophila")

# query classifications
taxa <- taxizedb::classification(ids)

# convert each tibble to wide format so we have a column for each rank
taxa_wide <- lapply(taxa, function(x) {
  # wide format with taxon names
  tidyr::pivot_wider(x[,1:2], names_from = rank, values_from = name)
  # wide format with taxon ids
  # tidyr::pivot_wider(x[,2:3], names_from = rank, values_from = id)
})

# bind the list of tibble into a single tibble
tbl <- dplyr::bind_rows(taxa_wide)

# add a column called ID
tbl$ID <- ids

# move ID columns to first column
tbl <- dplyr::relocate(tbl, ID)

tbl

Is this the format you were looking for?

lauraDRH commented 1 year ago

Hi! Thank you so much for your reply! That is exactly the format I was looking for, but when running it with my samples this warning appears:

Error in x[, 2:3] : incorrect number of dimensions
In addition: Warning messages:
1: Values from `id` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(rank) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L) 

Before running it I already removed duplicates from my ids vector, which looks like this (I am only showing the first 3 rows, but it has 3398ids):

ids
   [1]  270636  232991 1207504    1747 1359168 1173032 1410040 2023234   34062 1546149 1028307  752179  163011  335992
  [15] 1685378  585054  142479   86183    1280   47770     944  244596     301     470   70775  147802  418708 1463597
  [29] 1458253 2056700    1270  288426 1545044 1282664 1168035 1697053   28090 1033739  376175  171674 1190813     287

So I really do not understand that error... because there are supposed to be no duplicates

stitam commented 1 year ago

Oh I see why this happens, e.g. here:

taxizedb::classification(270636)
#> $`270636`
#>                                  name         rank      id
#> 1                  cellular organisms      no rank  131567
#> 2                            Bacteria superkingdom       2
#> 3                 Terrabacteria group        clade 1783272
#> 4 Cyanobacteria/Melainabacteria group        clade 1798711
#> 5                       Cyanobacteria       phylum    1117
#> 6                        Spirulinales        order 1890443
#> 7                       Spirulinaceae       family 1890448
#> 8                           Spirulina        genus    1154
#> 9                     Spirulina major      species  270636
#> 
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "ncbi"

Created on 2023-07-07 with reprex v2.0.2

The term clade for the column rank is not unique, so when you want to convert this to wide format the function is unsure what to write in the new clade column, 1783272, or 1798711. A potential workaround is to first filter to the ranks you are interested in (and hope none of these will be duplicated for any of your queries):

# query classifications
taxa <- taxizedb::classification(ids)

# filter taxon ranks
lapply(taxa, function(x) dplyr::filter(x, rank %in% c("phylum", "order", "family", "genus")))

And then convert to wide format. Fingers crossed, let me know if this solves the issue!

lauraDRH commented 1 year ago

omg thank you so much! I totally missed that clade duplicate that worked perfectly fine!!

thanks for your time and developing this package, works perfectly :)

stitam commented 1 year ago

Instead of filtering taxon ranks you can also define a function that will collapse conflicting entries into a single entry:

taxa_wide <- lapply(taxa, function(x) {
  tidyr::pivot_wider(
    x[,1:2],
    names_from = rank,
    values_from = name,
    values_fn = function(x) paste(x, collapse = "|"))
})

For 270636 this will put 1783272|1798711 into the new clade column.

stitam commented 1 year ago

I think a utility function which converts the list output to wide format like we discussed here might be useful so I'll keep this issue open for now.

lauraDRH commented 1 year ago

I think a utility function which converts the list output to wide format like we discussed here might be useful so I'll keep this issue open for now.

Yes, I think it would be super useful, as in the end probably a lot of users need it

MKeao commented 3 months ago

Hi! Just wanted to say thanks, this fixed my issue today. I used the non-filtering option and it worked well.