ALDEx2 on large dataset gives memory not mapped error

Rridley7 commented 1 year ago

Hello, thanks for your work on this great tool! I am currently running into issues with a large and sparse dataset which I am looking to run in aldex. I am running on a SLURM cluster with 745 GB of RAM available - my data currently is 545000 features (genes) and 291 observances. I have successfully run smaller datasets without issue, including the tutorial, thus I think the installation is not the problem. The command I am running is fairly simple outside of the size of the data:

dr_deg = aldex(r_df, condits_deg, mc.samples=40,denom='zero',verbose=T , useMC=T)

I get the output:

aldex.clr: generating Monte-Carlo instances and clr values
operating in serial mode
removed rows with sums equal to zero
computing zero removal
data format is OK
dirichlet samples complete
transformation complete
aldex.ttest: doing t-test
running tests for each MC instance:
|------------(25%)----------(50%)----------(75%)----------|
aldex.effect: calculating effect sizes
operating in serial mode
sanity check complete

 *** caught segfault ***
address 0x2aace27e8d80, cause 'memory not mapped'

Traceback:
 1: Rfast::rowMedians(cl2p)
 2: aldex.effect(x, include.sample.summary = include.sample.summary,     verbose = verbose,
paired.test = paired.test)
 3: aldex(r_df, condits_deg, mc.samples = 40, denom = "zero", verbose = T,     useMC = T)
An irrecoverable exception occurred. R is aborting now ...
/usr/local/pace-apps/lmod/lmod/init/sh: line 27: 258896 Segmentation fault      apptainer ex
ec /usr/local/pace-apps/manual/packages/singularity_images/bioconductor_docker_3_15.sif Rscr
ipt "$@"

When I look at the maximum memory usage for the job, I get ~160 GB max memory used, so I do not see this as the issue in the setup, unless it potentially relates to the memory available for a single thread?

ggloor commented 1 year ago

Interesting. It is dying in the very first step of the aldex.effect function which is cbind() to make the data matrix, followed by the rowMedians(). If you are using the most recent version you should try useMC=F and see if that makes a difference, although it should not as it reports that is using serial mode. I have run over 300K features without incident on my 32Gb laptop without issue. I would need a minimal example to figure out what the issue is.

Rridley7 commented 1 year ago

That indeed is interesting. I have attached here the input matrix (r_df) and treatment conditions (condits_deg) just as they were input into the script: https://www.dropbox.com/t/zkfrXwH9FEMCBkns (I can also attach text files if that is easier)

Rridley7 commented 1 year ago

Additionally, I did test the useMC=F, and received the same error.

Rridley7 commented 1 year ago

Update: I have found that it will run with an extremely low number of Monte carlo samples (currently set to 2). Is there a way to properly average results across multiple dataset with this value, or is it too low to be stable?

ggloor commented 1 year ago

Thanks for this piece of information. With the dataset size you have you can get reasonable estimates with small MC replicate numbers. You could run the analysis 4 times and average across them to simulate the results from 8 MC replicates. Given the large number of replicates this would be similar to pulling 128 MC replicates from sample sizes in the 10-20 range

ggloor commented 1 year ago

could you re-send the link and yes text files would be best. I can then figure out where the memory bottleneck is

Rridley7 commented 1 year ago

Updated link: https://www.dropbox.com/t/BHnoGd3KeAox1Zqm Text versions: https://www.dropbox.com/t/MjMHkDc8vZDWuSHq When you average across them, how does this multiple testing average still control for the adjusted p-values? Or is this averaging okay based on monte-carlo?

ggloor commented 1 year ago

yes, the average across a small number of MC replicates gives (within measurement error) the same as you would get with a large number of MC replicates. I used the test data and did 2 MC replicates 100 times and made a second analysis where I used the standard number of replicates (128). Then I compared the average FDR between the 2x100 replicates and the 128 replicates.

Rridley7 commented 1 year ago

I see, that makes sense. One follow up, is this equality not expected to change for data with reasonably high sample variance?

ggloor commented 1 year ago

This is not expected to change with any data set type. The dates I used has a mixture of very high variance and very low variance parts. It is actually the most difficult to analyze dataset I've come across. Effectively what I've done by setting mc.samples=2 is to emulate manually what ALDEx2 is doing in the background in the aldex.effect, and aldex.ttest functions. my code is attached, very hacky

So you should be good to go

e.list <- list() t.list <- list()

reps <- 100

for(i in 1:reps){ x <- aldex.clr(selex, conds, mc.samples=2) e.list[[i]] <- aldex.effect(x) t.list[[i]] <- aldex.ttest(x) }

p.vals <- matrix(data=NA, ncol=reps, nrow=1600) e.vals <- matrix(data=NA, ncol=reps, nrow=1600) win <- matrix(data=NA, ncol=reps, nrow=1600) btw <- matrix(data=NA, ncol=reps, nrow=1600)

for(i in 1:reps){ p.vals[,i] <- t.list[[i]]$we.eBH e.vals[,i] <- e.list[[i]]$effect win[,i] <- e.list[[i]]$diff.win btw[,i] <- e.list[[i]]$diff.btw }

plot(x.all$diff.win, x.all$diff.btw) points(rowMeans(win), rowMeans(btw), col='red', cex=0.6)

plot(x.all$effect, rowMeans(e.vals)) abline(0,1)

plot(x.all$we.eBH, rowMeans(p.vals), log='xy') abline(0,1)

Rridley7 commented 1 year ago

Got it, thank you for this explanation!

ggloor commented 1 year ago

Finally figured it out. The aldex.clr() function with your dataset is hitting up against the vector length limit of R with a large number of MC instances. The limit is ~1B elements. There are work-arounds but they are probably not a lot faster than the solution we came up with.

Rridley7 commented 1 year ago

Got it, thank you!

ggloor / ALDEx_bioc

ALDEx2 on large dataset gives memory not mapped error #55