waldronlab / curatedMetagenomicData

Curated Metagenomic Data of the Human Microbiome
https://waldronlab.io/curatedMetagenomicData
Artistic License 2.0
127 stars 28 forks source link

cannot coerce class ‘structure("dgTMatrix", package = "Matrix")’ to a data.frame #253

Closed zhuzn closed 3 years ago

zhuzn commented 3 years ago

Describe the bug sampleMetadata %>% dplyr::filter(country=="CHN") %>% returnSamples("gene_families")

To Reproduce cannot coerce class ‘structure("dgTMatrix", package = "Matrix")’ to a data.frame

schifferl commented 3 years ago

Hi @zhuzn, I've just investigated this myself and have confirmed this is a bug; thank you for reporting it. I'll work on a patch and send you an update when it's done.

szimmerman92 commented 3 years ago

Hi @zhuzn and @schifferl I found the bug and may have a fix. The error is in the mergeData function which is used in inside returnSamples to merge matrices from multiple studies together. The specific line of code that gives rise to the error is

assays <- purrr::map(mergeList, SummarizedExperiment::assay) %>% purrr::map(base::as.data.frame) %>% purrr::map(tibble::rownames_to_column) %>% purrr::reduce(dplyr::full_join, by = "rowname") %>% tibble::column_to_rownames() %>% dplyr::mutate(dplyr::across(.fns = ~tidyr::replace_na(.x, 0))) %>% base::as.matrix() %>% S4Vectors::SimpleList() %>% magrittr::set_names(assay_name) The specific part that throws the error is purrr::map(base::as.data.frame) because as the error advertises you cannot convert a dgTMatrix to a data.frame. You have to first convert it to a matrix, and then convert the matrix to a data frame. However, if you want to then merge multiple dataframes together, this become computationally very difficult. So I found a function native to the R package Seurat, that directly merges sparse matrices together. The edited mergeData function is below.

mergeData = function (mergeList) { if (base::length(mergeList) == 1) { stop("mergeList contains only a single element", call. = FALSE) } assay_name <- purrr::map_chr(mergeList, SummarizedExperiment::assayNames) %>% base::unique() if (base::length(assay_name) != 1) { stop("dataType of list elements is different", call. = FALSE) } duplicate_colnames <- purrr::map(mergeList, base::colnames) %>% purrr::reduce(base::intersect) if (base::length(duplicate_colnames) != 0) { stop("colnames/sample_id values are not unique", call. = FALSE) } asssays_list = purrr::map(mergeList, SummarizedExperiment::assay) assays = RowMergeSparseMatrices(asssays_list[[1]],asssays_list[-1]) rowData <- purrr::map(mergeList, SummarizedExperiment::rowData) %>% purrr::map(base::as.data.frame) %>% purrr::map(tibble::rownames_to_column) join_by <- purrr::map(rowData, base::colnames) %>% purrr::reduce(base::intersect) rowData <- purrr::reduce(rowData, dplyr::full_join, by = join_by) %>% tibble::column_to_rownames() %>% S4Vectors::DataFrame() colData <- purrr::map(mergeList, SummarizedExperiment::colData) %>% purrr::map(base::as.data.frame) %>% purrr::map(tibble::rownames_to_column) %>% dplyr::bind_rows() %>% tibble::column_to_rownames() %>% S4Vectors::DataFrame() if (assay_name == "relative_abundance") { TreeSummarizedExperiment::TreeSummarizedExperiment(assays = assays, rowData = rowData, colData = colData, rowTree = phylogeneticTree) } else { SummarizedExperiment::SummarizedExperiment(assays = assays, rowData = rowData, colData = colData) } }

To summarize, I replaced

assays <- purrr::map(mergeList, SummarizedExperiment::assay) %>% purrr::map(base::as.data.frame) %>% purrr::map(tibble::rownames_to_column) %>% purrr::reduce(dplyr::full_join, by = "rowname") %>% tibble::column_to_rownames() %>% dplyr::mutate(dplyr::across(.fns = ~tidyr::replace_na(.x, 0))) %>% base::as.matrix() %>% S4Vectors::SimpleList() %>% magrittr::set_names(assay_name) with

asssays_list = purrr::map(mergeList, SummarizedExperiment::assay) assays = RowMergeSparseMatrices(asssays_list[[1]],asssays_list[-1])

I am not sure if this is ideal but I hope this helps! It seems to work for me.

Best, Sam

schifferl commented 3 years ago

Hi @zhuzn, this bug has been patched in release and devel. The changes may take a day or so to show up in the Bioconductor builds. It was simply a matter of coercing to matrix first as @szimmerman92 suggested. Also, @szimmerman92...

"However, if you want to then merge multiple dataframes together, this become computationally very difficult. So I found a function native to the R package Seurat, that directly merges sparse matrices together."

much of the time is rather spent realizing the sparse matrix objects in memory; the merging should be very efficient. We'll skip adding Seurat for now because it would be a big dependency to add for a single function.

Finally, just be aware @zhuzn that your code at the top of the thread will attempt to merge gene_families from 10 studies and you'll need a huge amount of memory (each matrix is ~ 4M rows). We don't have a nice solution for that problem just yet, but perhaps in the future. Anyway, to verify the bug was resolved, I ran all of the following examples.

sampleMetadata |>
    dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
    dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
    returnSamples("gene_families")

sampleMetadata |>
    dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
    dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
    returnSamples("marker_abundance")

sampleMetadata |>
    dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
    dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
    returnSamples("marker_presence")

sampleMetadata |>
    dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
    dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
    returnSamples("pathway_abundance")

sampleMetadata |>
    dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
    dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
    returnSamples("pathway_coverage")

sampleMetadata |>
    dplyr::filter(stringr::str_detect(study_name, "AsnicarF_2017|LassalleF_2017")) |>
    dplyr::select(where(~ !base::all(base::is.na(.x)))) |>
    returnSamples("relative_abundance")