waldronlab / curatedMetagenomicData

Curated Metagenomic Data of the Human Microbiome
https://waldronlab.io/curatedMetagenomicData
Artistic License 2.0
127 stars 28 forks source link

improved handling of mergeData when different data types are present #295

Closed antagomir closed 1 year ago

antagomir commented 1 year ago

Describe the bug

The mergeData function returns an error for all data sets that I tried to combine (LeeKA_2022, KaurK_2020, BritoIL_2016, AsnicarF_20). Not sure if this is a bug or a feature but I tried to follow the manpage. If the behavior is intended, the manpage could be potentially clarified.

To Reproduce

Let us load the library library(curatedMetagenomicData)

Let us load a data set x <- curatedMetagenomicData("LeeKA_2022*", dryrun = FALSE)

Now the first 5 elements in the list x are SE and the 6th element is a TreeSE object.

Then try to merge the data: (x |> mergeData())

This throws an error:

"Error: dataType of list elements is different"

Same happens also when we do this only with SE objects (same type):

Then try to merge the data: (x[1:2] |> mergeData())

This throws an error:

"Error: dataType of list elements is different"

Expected behavior

I expected that the mergeRows would have merged the list elements. The code was copy-pasted from the package homepage.

lwaldron commented 1 year ago

We could probably have a more informative error, but mergeData() wasn't intended for merging different data types (ie relative abundance with gene markers, pathway abundance, etc). What you want to do is actually a fair bit simpler than what mergeData() does since these objects have no variables in common. I'm not sure what your full use case is but for this example you could just use rbind() something like:

purrr::map(x, function(se) {names(assays(se)) <- "data"; return(se)}) |>
  purrr::reduce(rbind)

(the first line being necessary because rbind quite reasonably errors instead of combining different assays with different names, so we have to trick it by giving assays the same name)

Or if you want to merge data types of multiple datasets, use mergeData() on single data types like demoed on the package homepage, then rbind to merge those. Unless there are reasonably common and compelling use cases (?) I think this will be a documentation enhancement rather than expanding the functionality of mergeData() to allow combining unrelated data types into the same assay of a SummarizedExperiment.

antagomir commented 1 year ago

Ah yes, of course :)

schifferl commented 1 year ago

From the documentation, this is perfectly clear to me:

To merge the list elements returned from curatedMetagenomicData into a single SummarizedExperiment or TreeSummarizedExperiment object, users will use mergeData() provided elements are the same dataType

And the error message you've referenced spells it out again:

Error: dataType of list elements is different

The documentation and behavior are as intended

Rather than make changes, we will stay focused on releasing version 4.0.0.