nkoneill / DeTREM

MuSiC modified to use single-nuclei RNA-Seq reference
GNU General Public License v3.0
3 stars 0 forks source link

Memory requirements #2

Closed gevro closed 4 months ago

gevro commented 4 months ago

Hi, With 100k single cells + 36 bulk samples for 52k genes, I'm crashing due to memory limitations with 128 Gb RAM.

How much memory do you recommend / is necessary?

Thanks

nkoneill commented 4 months ago

Hi gevro, I have not hit a memory issue before, and that doesn't sound like an unreasonably large amount of data. Can you check if the same analysis runs in the original MuSiC (it should be very similar and easy to run). If it works there but does not work here I can help you troubleshoot this. Best, Nick

gevro commented 4 months ago

It looks like MuSiC is having the same problem. Not sure what the issue is...

gevro commented 4 months ago

I have two more questions:

  1. Is deTREM updated to use the latest version of Music?
  2. How is deTREM different from Music2? Is the idea of what they are trying to do similar?

Thanks!

gevro commented 4 months ago

Also, I can get it to run with a 500 cell single-cell reference with 128 Gb ram, but it crashes with a 5000 cell single-cell reference. And I have a total of 120,000 cells, so I don't see how I can run deTREM. There must be some memory issue somewhere?

gevro commented 4 months ago

Also, with MUSIC I am able to run with up to ~100,000 cells reference. So it seems that deTREM has some adidtional memory issue that prevents more than ~500 cells to be used.

gevro commented 4 months ago

Note, it crashes even before outputting the message "Creating Relative Abundance Matrix...". So the memory overload is happening in the basic steps of setting up the data.

nkoneill commented 4 months ago

Hi gevro,

1: No, DeTREM was created from MuSiC version 0.2.0. Their update appears to have added functionality for SingleCellExperiment objects, which DeTREM lacks.

2: DeTREM is designed to adjust for strong reference/target differences, specifically snRNAseq vs bulk RNA-seq. MuSiC2 is meant to adjust for differences that affect a subset of samples, such as using a reference of healthy subjects to deconvolute a target dataset that includes disease states.

nkoneill commented 4 months ago

I followed your comments in the MuSiC github. That is now working fine with your large reference?

gevro commented 4 months ago

Yes I can get Music v1 to work now with up to 100,000 single-cell reference with 128 Gb ram.

But deTREM crashes due to memory overload with more than ~500 single-cells.

nkoneill commented 4 months ago

It looks like you were having memory issues with MuSiC earlier, may I ask what you did to solve them in the original program?

gevro commented 4 months ago

I solved the MUSIC issue by subsetting to 100,000 cells instead of the full 120,000.

Regardless is is ok for me to run deTREM with the latest version of MUSIC, or does deTREM contain its own version of MUSIC internally?

If there are any steps in the initial pre-processing of the data structures that duplicate the data structures, I think that would overload the memory. The full 120,000 single-cell reference is ~50 Gb in memory. More than two copies of it would be expected to crash R.

I have a guess what is happening. In these two steps in music_basis, it seems like at least two additional copies of the single-cell reference are being made in memory?

if(non.zero){ ## eliminate non expressed genes nz.gene = rownames(x)[( rowSums(exprs(x)) != 0 )] x <- x[nz.gene, , drop = FALSE] }

clusters <- as.character(pData(x)[, clusters]) samples <- as.character(pData(x)[, samples])

M.theta <- sapply(unique(clusters), function(ct){ my.rowMeans(sapply(unique(samples), function(sid){ y = exprs(x)[,clusters %in% ct & samples %in% sid, drop = FALSE] rowSums(y)/sum(y) }), na.rm = TRUE) })

gevro commented 4 months ago

i.e. when 'x' is passed to music_basis as the single-cell reference, once x is modified, it is an additional copy in memory in addition to the original object I input into the master deTREM function.

And then M.theta will contain an additional copy of the data from x.

I could try removing non-expressed genes prior to deTREM, maybe that would help, even though these lines will still duplicate the data.

nkoneill commented 4 months ago

DeTREM contains its own version of MuSiC. It is not impossible that having both loaded is causing this issue? I had not run into this issue myself though. Could you try running this without loading the MuSiC library to test?

gevro commented 4 months ago

Ok. I will try to run deTREM without loading MUSIC.

Note, another possibility is this line in utils.R deTREM: bulk.gene = rownames(bulk.eset)[rowMeans(exprs(bulk.eset)) != 0] Music: bulk.gene = bulk.gene = rownames(bulk.mtx)[rowMeans(bulk.mtx) != 0]

You can see that the latest version of MUSIC directly calculates rowMeans from the input bulk.mtx, whereas deTREM performs the exprs() function on the ExpressionSet input data and then calculates rowMeans. It could be this causes a temporary double amount of memory required.

nkoneill commented 4 months ago

In the current version of MuSiC it appears to do the same things:

(At the start of music_basis)
if(non.zero){ ## eliminate non expressed genes x <- x[rowSums(counts(x))>0, ] }

(Later on, but before "Creating Relative Abudance Matrix...") M.theta <- sapply(unique(clusters), function(ct){ my.rowMeans(sapply(unique(samples), function(sid){ y = counts(x)[,clusters %in% ct & samples %in% sid] if(is.null(dim(y))){ return(y/sum(y)) }else{ return(rowSums(y)/sum(y)) } }), na.rm = TRUE) })

I'm not sure what would be different in music_basis prior to that line.. hmm..

nkoneill commented 4 months ago

Ok. I will try to run deTREM without loading MUSIC.

Note, another possibility is this line in utils.R deTREM: bulk.gene = rownames(bulk.eset)[rowMeans(exprs(bulk.eset)) != 0] Music: bulk.gene = bulk.gene = rownames(bulk.mtx)[rowMeans(bulk.mtx) != 0]

You can see that the latest version of MUSIC directly calculates rowMeans from the input bulk.mtx, whereas deTREM performs the exprs() function on the ExpressionSet input data and then calculates rowMeans. It could be this causes a temporary double amount of memory required.

Ahh interesting

gevro commented 4 months ago

Sorry, one more question.

I'm using a single-nucleus RNA-seq healthy reference. And I want to deconvolute bulk RNA-seq from 3 different groups: healthy and two different disease states.

Is deTREM the best approach actually? Or Music2?

gevro commented 4 months ago

Looks like not loading the MUSIC library before loading deTREM fixed the problem! At least now it is working with 5,000 cells. I will see how high I can go.

nkoneill commented 4 months ago

Oh, lovely! Dang, the update they posted three months ago may have brought that issue back up. Thanks for identifying it.

gevro commented 4 months ago

Yup! Also, do you have an answer to my question ion the preceding post? Thanks!

nkoneill commented 4 months ago

Sorry, one more question.

I'm using a single-nucleus RNA-seq healthy reference. And I want to deconvolute bulk RNA-seq from 3 different groups: healthy and two different disease states.

Is deTREM the best approach actually? Or Music2?

Tough question. I'd recommend trying them both. If the results from MuSiC2 are reasonable then I would recommend using that. If their results have a lot of missing estimates (likely because of snRNA-seq issues) then I would use DeTREM, but it sounds like that reference isn't perfectly appropriate and may lead to issues or inaccuracy deconvolving disease states. At the very least I would recommend comparing 'missingness' (the number of estimates equaling 0) between healthy and disease states. If those are disparate and MuSiC2 didn't work out then you may have to find a new reference.

nkoneill commented 4 months ago

Best of luck!

gevro commented 4 months ago

Ok thanks! Music2 has its own bugs so I need to figure that out first :-)

gevro commented 4 months ago

Just thought of one more question: is it better to run deTREM separately for each of my 'group types' in my bulk data separately?

i.e. deTREM for my healthy group. Then separately deTREM for my disease group 1. Then separately deTREM for my diseease group 2. Or just deTREM on all the groups together in one run?

nkoneill commented 4 months ago

I'd recommend running them all at once. The separate approach will lead to different weights for genes in each run and may lead to very incomparable results.

gevro commented 4 months ago

I'm still maxing out at ~50,000 cells, but my data is 120,000 cells. Is there any other trick to reducing memory requirements? is there any way to run deTREM with sparse matrices or BPcells objects?

Thanks!

nkoneill commented 4 months ago

Unfortunately I am not aware of any approaches to reducing memory load here. The code explicitly refers to slots of an expressionSet (sc.eset@assayData$exprs), so unless you have a sparse matrix object that acts exactly like an expressionSet I don't think it will work.

gevro commented 4 months ago

Thanks.