Closed durrantmm closed 2 years ago
Hi @durrantmm, you have mostly answered your own question here. Your hardware is insufficient to process gene_families
data from the HMP_2019_ibdmdb
study – this is not a curatedMetagenomicData issue. Also, you can accomplish writing to tsv with far few lines as follows.
curatedMetagenomicData::curatedMetagenomicData("HMP_2019_ibdmdb.gene_families", dryrun = FALSE) |>
purrr::map(~ SummarizedExperiment::assay(.x)) |>
purrr::map(~ base::as.matrix(.x)) |>
purrr::map(~ tibble::as_tibble(.x, rownames = "rowname")) |>
purrr::imap(~ readr::write_tsv(.x, base::paste(.y, "tsv", sep = ".")))
There is also curatedMetagenomicDataTerminal to do this from a terminal.
curatedMetagenomicData "HMP_2019_ibdmdb.gene_families" > HMP_2019_ibdmdb.gene_families.tsv
Or you could use from another language directly (e.g. Python) by just reading from STDOUT.
I tried this on a Linux machine with 1TB of RAM and the same error occurs (see below with free -g
). It would be worth seeing if there are ways to reduce the RAM footprint, or at least document how much RAM is expected to be required for datasets of different sizes.
(base) levi@supermicro:~/Downloads$ curatedMetagenomicData "HMP_2019_ibdmdb.gene_families" > HMP_2019_ibdmdb.gene_families.tsv
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> as.matrix.Matrix -> as -> asMethod
Execution halted
(base) levi@supermicro:~/Downloads$ free -g
total used free shared buff/cache available
Mem: 1006 24 419 0 563 976
Swap: 0 0 0
(base) levi@supermicro:~/Downloads$
I was using a machine with 32 cores and 240 GB of RAM. Looks like the RAM requirements are unreasonable.
This seems to be related to a limitation in converting a sparse to a dense matrix, regardless of the actual amount of RAM available. This page suggests some workarounds:
A simpler solution would be to save sparse matrices directly in a sparse Matrix Market file format (https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/externalFormats.html), and avoid conversion altogether. I think this issue should be re-opened, since currently it is impossible to run the terminal client on the larger sparse matrices with any amount of memory. If you are working in R, it would be best to stick with sparseMatrix tools at least until this huge matrix with a lot of zeros is filtered down.
If writing to MatrixMarket format was the desired outcome, it can be done in three lines.
curatedMetagenomicData::curatedMetagenomicData("WindTT_2020.gene_families", dryrun = FALSE) |>
purrr::map(~ SummarizedExperiment::assay(.x)) |>
purrr::imap(~ Matrix::writeMM(.x, base::paste(.y, "mtx", sep = ".")))
I can think of possible methods to get around the sparse/dense matrix issue, but neither curatedMetagenomicData nor curatedMetagenomicDataTerminal would be the place to implement such a solution. The terminal interface should be able to write the smaller gene_families
to TSV without issue and importantly produces them in "wide" format for tasks like machine learning. Please close this when you are ready @lwaldron and sorry I don't have a better solution for you @durrantmm.
Describe the bug I get an error when I try to download the
HMP_2019_ibdmdb.gene_families
data table:To Reproduce This is the code that I ran:
Expected behavior I was expecting this to save a table of the gene_families abundance. This works for studies with fewer samples. I believe the data table runs into some memory limit when I convert it into a data frame.
Thank you.