mtmorgan / DirichletMultinomial

https://mtmorgan.github.io/DirichletMultinomial/
11 stars 7 forks source link

Can dmn be run in parts? #7

Open marcinschmidt opened 1 year ago

marcinschmidt commented 1 year ago

I've got quite a large dataset I want to analyse with dmn. Running it with fit <- mclapply(1:20, dmn, count=count, verbose=TRUE) on my desktop did not complete within 30 days (using all 4 cores). Probably power outage cancelled calculations as the system was reloaded. I divided the dataset into parts and run it also on a server. Some parts were finished but there is a 7-days limit and some needed more time. I would prefer to run the data as a full dataset.

Can I replace fit <- mclapply(1:20, dmn, count=count, verbose=TRUE) with

fit1 <- mclapply(1:7, dmn, count=count, verbose=TRUE)
fit2 <- mclapply(8:14, dmn, count=count, verbose=TRUE)
fit3 <- mclapply(15:20, dmn, count=count, verbose=TRUE)

How to combine fit1 (1:7), fit2 (8:14), and fit3 (15:20) into fit (1:20) ?

mtmorgan commented 1 year ago

mclapply just returns a list, so combining is just c(fit1, fit2, fit3). The vignette outlines additional steps to extract and work with individual components of the objects returned by dmn.

I wonder how big your data is? Also I wonder if the long running time is due to the size of the data or some other limitation, e.g., memory use.

Also is there something to do upstream to make the data smaller, e.g., some kind of dimensional reduction before doing the 'full' analysis; I have not worked in this space for a while so don't know if that is a good idea or not.

marcinschmidt commented 1 year ago

Hi! I run my data in chunks of [189, 8693], [191, 8693], and [197, 8693]. The server I used lately analysed with benchmarkme::get_ram() returns 201 GB and parallel::detectCores() returns 48 for plot(benchmarkme::benchmark_std())  gives image You are ranked 192 out of 749 machines.  image You are ranked 419 out of 747 machines.  image You are ranked 392 out of 747 machines.

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Does it mean for you anything specific? I'm biologist... and that is the most powerful machine I can use. Probably the dimensional reduction might be a solution. I will give it a try. When I submit my job to queue (SLURM) I use:

#SBATCH --nodes=1                # node count
#SBATCH --ntasks-per-node=4
#SBATCH --mem=38gb
#SBATCH --time=6-23:59:00          # total run time limit (HH:MM:SS)

I might try increasing number of nodes and mem to 128 or even 256 GB but time limit is 7 days anyway. Let me know if you have any idea. Best regards, Marcin