Closed Rridley7 closed 1 year ago
Interesting. It is dying in the very first step of the aldex.effect function which is cbind() to make the data matrix, followed by the rowMedians(). If you are using the most recent version you should try useMC=F and see if that makes a difference, although it should not as it reports that is using serial mode. I have run over 300K features without incident on my 32Gb laptop without issue. I would need a minimal example to figure out what the issue is.
That indeed is interesting. I have attached here the input matrix (r_df) and treatment conditions (condits_deg) just as they were input into the script: https://www.dropbox.com/t/zkfrXwH9FEMCBkns (I can also attach text files if that is easier)
Additionally, I did test the useMC=F, and received the same error.
Update: I have found that it will run with an extremely low number of Monte carlo samples (currently set to 2). Is there a way to properly average results across multiple dataset with this value, or is it too low to be stable?
Thanks for this piece of information. With the dataset size you have you can get reasonable estimates with small MC replicate numbers. You could run the analysis 4 times and average across them to simulate the results from 8 MC replicates. Given the large number of replicates this would be similar to pulling 128 MC replicates from sample sizes in the 10-20 range
could you re-send the link and yes text files would be best. I can then figure out where the memory bottleneck is
Updated link: https://www.dropbox.com/t/BHnoGd3KeAox1Zqm Text versions: https://www.dropbox.com/t/MjMHkDc8vZDWuSHq When you average across them, how does this multiple testing average still control for the adjusted p-values? Or is this averaging okay based on monte-carlo?
yes, the average across a small number of MC replicates gives (within measurement error) the same as you would get with a large number of MC replicates. I used the test data and did 2 MC replicates 100 times and made a second analysis where I used the standard number of replicates (128). Then I compared the average FDR between the 2x100 replicates and the 128 replicates.
I see, that makes sense. One follow up, is this equality not expected to change for data with reasonably high sample variance?
This is not expected to change with any data set type. The dates I used has a mixture of very high variance and very low variance parts. It is actually the most difficult to analyze dataset I've come across. Effectively what I've done by setting mc.samples=2 is to emulate manually what ALDEx2 is doing in the background in the aldex.effect, and aldex.ttest functions. my code is attached, very hacky
So you should be good to go
e.list <- list() t.list <- list()
reps <- 100
for(i in 1:reps){ x <- aldex.clr(selex, conds, mc.samples=2) e.list[[i]] <- aldex.effect(x) t.list[[i]] <- aldex.ttest(x) }
p.vals <- matrix(data=NA, ncol=reps, nrow=1600) e.vals <- matrix(data=NA, ncol=reps, nrow=1600) win <- matrix(data=NA, ncol=reps, nrow=1600) btw <- matrix(data=NA, ncol=reps, nrow=1600)
for(i in 1:reps){ p.vals[,i] <- t.list[[i]]$we.eBH e.vals[,i] <- e.list[[i]]$effect win[,i] <- e.list[[i]]$diff.win btw[,i] <- e.list[[i]]$diff.btw }
plot(x.all$diff.win, x.all$diff.btw) points(rowMeans(win), rowMeans(btw), col='red', cex=0.6)
plot(x.all$effect, rowMeans(e.vals)) abline(0,1)
plot(x.all$we.eBH, rowMeans(p.vals), log='xy') abline(0,1)
Got it, thank you for this explanation!
Finally figured it out. The aldex.clr() function with your dataset is hitting up against the vector length limit of R with a large number of MC instances. The limit is ~1B elements. There are work-arounds but they are probably not a lot faster than the solution we came up with.
Got it, thank you!
Hello, thanks for your work on this great tool! I am currently running into issues with a large and sparse dataset which I am looking to run in aldex. I am running on a SLURM cluster with 745 GB of RAM available - my data currently is 545000 features (genes) and 291 observances. I have successfully run smaller datasets without issue, including the tutorial, thus I think the installation is not the problem. The command I am running is fairly simple outside of the size of the data:
dr_deg = aldex(r_df, condits_deg, mc.samples=40,denom='zero',verbose=T , useMC=T)
I get the output:
When I look at the maximum memory usage for the job, I get ~160 GB max memory used, so I do not see this as the issue in the setup, unless it potentially relates to the memory available for a single thread?