"problem too large" when downloading gene_families to CSV

durrantmm commented 2 years ago

Describe the bug I get an error when I try to download the HMP_2019_ibdmdb.gene_families data table:

snapshotDate(): 2021-10-19
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'rownames': Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102
Calls: %>% ... as.matrix -> as -> asMethod -> .handleSimpleError -> h
Execution halted

To Reproduce This is the code that I ran:

require(curatedMetagenomicData)
require(dplyr)
require(readr)

study <- "HMP_2019_ibdmdb"
trexp_genefam <- curatedMetagenomicData(paste0(study, '.gene_families'), dryrun=F, rownames="short")
genefam <- assays(trexp_genefam[[1]])[[1]]
genefam_out <- cbind(data.frame(genefam=rownames(as.matrix(genefam))), data.frame(as.matrix(genefam))) %>% group_by()

dir.create("OUTPUT/genefams/", recursive=T)
write_tsv(genefam_out, paste0('OUTPUT/genefams/', study, '.genefams.tsv'))

Expected behavior I was expecting this to save a table of the gene_families abundance. This works for studies with fewer samples. I believe the data table runs into some memory limit when I convert it into a data frame.

Thank you.

schifferl commented 2 years ago

Hi @durrantmm, you have mostly answered your own question here. Your hardware is insufficient to process gene_families data from the HMP_2019_ibdmdb study – this is not a curatedMetagenomicData issue. Also, you can accomplish writing to tsv with far few lines as follows.

curatedMetagenomicData::curatedMetagenomicData("HMP_2019_ibdmdb.gene_families", dryrun = FALSE) |>
    purrr::map(~ SummarizedExperiment::assay(.x)) |>
    purrr::map(~ base::as.matrix(.x)) |>
    purrr::map(~ tibble::as_tibble(.x, rownames = "rowname")) |>
    purrr::imap(~ readr::write_tsv(.x, base::paste(.y, "tsv", sep = ".")))

There is also curatedMetagenomicDataTerminal to do this from a terminal.

curatedMetagenomicData "HMP_2019_ibdmdb.gene_families" > HMP_2019_ibdmdb.gene_families.tsv

Or you could use from another language directly (e.g. Python) by just reading from STDOUT.

lwaldron commented 2 years ago

I tried this on a Linux machine with 1TB of RAM and the same error occurs (see below with free -g). It would be worth seeing if there are ways to reduce the RAM footprint, or at least document how much RAM is expected to be required for datasets of different sizes.

(base) levi@supermicro:~/Downloads$ curatedMetagenomicData "HMP_2019_ibdmdb.gene_families" > HMP_2019_ibdmdb.gene_families.tsv
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> as.matrix.Matrix -> as -> asMethod
Execution halted
(base) levi@supermicro:~/Downloads$ free -g
              total        used        free      shared  buff/cache   available
Mem:           1006          24         419           0         563         976
Swap:             0           0           0
(base) levi@supermicro:~/Downloads$

durrantmm commented 2 years ago

I was using a machine with 32 cores and 240 GB of RAM. Looks like the RAM requirements are unreasonable.

lwaldron commented 2 years ago

This seems to be related to a limitation in converting a sparse to a dense matrix, regardless of the actual amount of RAM available. This page suggests some workarounds:

https://programmerah.com/the-sparse-matrix-of-r-language-is-too-large-to-be-used-as-matrix-8856/

A simpler solution would be to save sparse matrices directly in a sparse Matrix Market file format (https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/externalFormats.html), and avoid conversion altogether. I think this issue should be re-opened, since currently it is impossible to run the terminal client on the larger sparse matrices with any amount of memory. If you are working in R, it would be best to stick with sparseMatrix tools at least until this huge matrix with a lot of zeros is filtered down.

schifferl commented 2 years ago

If writing to MatrixMarket format was the desired outcome, it can be done in three lines.

curatedMetagenomicData::curatedMetagenomicData("WindTT_2020.gene_families", dryrun = FALSE) |>
    purrr::map(~ SummarizedExperiment::assay(.x)) |>
    purrr::imap(~ Matrix::writeMM(.x, base::paste(.y, "mtx", sep = ".")))

I can think of possible methods to get around the sparse/dense matrix issue, but neither curatedMetagenomicData nor curatedMetagenomicDataTerminal would be the place to implement such a solution. The terminal interface should be able to write the smaller gene_families to TSV without issue and importantly produces them in "wide" format for tasks like machine learning. Please close this when you are ready @lwaldron and sorry I don't have a better solution for you @durrantmm.

schifferl commented 2 years ago

https://github.com/AllenInstitute/scrattch.io/blob/master/R/write_csv.R

waldronlab / curatedMetagenomicData

"problem too large" when downloading gene_families to CSV #276