Closed zhuzn closed 3 years ago
Hi @zhuzn, I've just investigated this myself and have confirmed this is a bug; thank you for reporting it. I'll work on a patch and send you an update when it's done.
Hi @zhuzn and @schifferl I found the bug and may have a fix. The error is in the mergeData function which is used in inside returnSamples to merge matrices from multiple studies together. The specific line of code that gives rise to the error is
assays <- purrr::map(mergeList, SummarizedExperiment::assay) %>% purrr::map(base::as.data.frame) %>% purrr::map(tibble::rownames_to_column) %>% purrr::reduce(dplyr::full_join, by = "rowname") %>% tibble::column_to_rownames() %>% dplyr::mutate(dplyr::across(.fns = ~tidyr::replace_na(.x, 0))) %>% base::as.matrix() %>% S4Vectors::SimpleList() %>% magrittr::set_names(assay_name)
The specific part that throws the error is purrr::map(base::as.data.frame)
because as the error advertises you cannot convert a dgTMatrix to a data.frame. You have to first convert it to a matrix, and then convert the matrix to a data frame. However, if you want to then merge multiple dataframes together, this become computationally very difficult. So I found a function native to the R package Seurat, that directly merges sparse matrices together. The edited mergeData function is below.
mergeData = function (mergeList) { if (base::length(mergeList) == 1) { stop("mergeList contains only a single element", call. = FALSE) } assay_name <- purrr::map_chr(mergeList, SummarizedExperiment::assayNames) %>% base::unique() if (base::length(assay_name) != 1) { stop("dataType of list elements is different", call. = FALSE) } duplicate_colnames <- purrr::map(mergeList, base::colnames) %>% purrr::reduce(base::intersect) if (base::length(duplicate_colnames) != 0) { stop("colnames/sample_id values are not unique", call. = FALSE) } asssays_list = purrr::map(mergeList, SummarizedExperiment::assay) assays = RowMergeSparseMatrices(asssays_list[[1]],asssays_list[-1]) rowData <- purrr::map(mergeList, SummarizedExperiment::rowData) %>% purrr::map(base::as.data.frame) %>% purrr::map(tibble::rownames_to_column) join_by <- purrr::map(rowData, base::colnames) %>% purrr::reduce(base::intersect) rowData <- purrr::reduce(rowData, dplyr::full_join, by = join_by) %>% tibble::column_to_rownames() %>% S4Vectors::DataFrame() colData <- purrr::map(mergeList, SummarizedExperiment::colData) %>% purrr::map(base::as.data.frame) %>% purrr::map(tibble::rownames_to_column) %>% dplyr::bind_rows() %>% tibble::column_to_rownames() %>% S4Vectors::DataFrame() if (assay_name == "relative_abundance") { TreeSummarizedExperiment::TreeSummarizedExperiment(assays = assays, rowData = rowData, colData = colData, rowTree = phylogeneticTree) } else { SummarizedExperiment::SummarizedExperiment(assays = assays, rowData = rowData, colData = colData) } }
To summarize, I replaced
assays <- purrr::map(mergeList, SummarizedExperiment::assay) %>% purrr::map(base::as.data.frame) %>% purrr::map(tibble::rownames_to_column) %>% purrr::reduce(dplyr::full_join, by = "rowname") %>% tibble::column_to_rownames() %>% dplyr::mutate(dplyr::across(.fns = ~tidyr::replace_na(.x, 0))) %>% base::as.matrix() %>% S4Vectors::SimpleList() %>% magrittr::set_names(assay_name)
with
asssays_list = purrr::map(mergeList, SummarizedExperiment::assay) assays = RowMergeSparseMatrices(asssays_list[[1]],asssays_list[-1])
I am not sure if this is ideal but I hope this helps! It seems to work for me.
Best, Sam
Hi @zhuzn, this bug has been patched in release and devel. The changes may take a day or so to show up in the Bioconductor builds. It was simply a matter of coercing to matrix
first as @szimmerman92 suggested. Also, @szimmerman92...
"However, if you want to then merge multiple dataframes together, this become computationally very difficult. So I found a function native to the R package Seurat, that directly merges sparse matrices together."
much of the time is rather spent realizing the sparse matrix objects in memory; the merging should be very efficient. We'll skip adding Seurat
for now because it would be a big dependency to add for a single function.
Finally, just be aware @zhuzn that your code at the top of the thread will attempt to merge gene_families
from 10 studies and you'll need a huge amount of memory (each matrix is ~ 4M rows). We don't have a nice solution for that problem just yet, but perhaps in the future. Anyway, to verify the bug was resolved, I ran all of the following examples.
sampleMetadata |>
dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
returnSamples("gene_families")
sampleMetadata |>
dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
returnSamples("marker_abundance")
sampleMetadata |>
dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
returnSamples("marker_presence")
sampleMetadata |>
dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
returnSamples("pathway_abundance")
sampleMetadata |>
dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
returnSamples("pathway_coverage")
sampleMetadata |>
dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
returnSamples("relative_abundance")
Describe the bug sampleMetadata %>% dplyr::filter(country=="CHN") %>% returnSamples("gene_families")
To Reproduce cannot coerce class ‘structure("dgTMatrix", package = "Matrix")’ to a data.frame