kaskr / adcomp

AD computation with Template Model Builder (TMB)
Other
176 stars 80 forks source link

crash when nb of cores > 40 #303

Closed tiboloic closed 4 years ago

tiboloic commented 4 years ago

Hi,

I am running a big model that requires lots of memory on Google cloud platform. Model runs fine of 40 cores nodes but crashes on 80 cores nodes.

runExample('linreg_parallel') Running example linreg_parallel library(TMB) dyn.load(dynlib("linreg_parallel"))

Simulate data

set.seed(123) x <- seq(0, 10, length = 50001) data <- list(Y = rnorm(length(x)) + x, x = x) parameters <- list(a=0, b=0, logSigma=0)

Fit model

obj <- MakeADFun(data, parameters, DLL="linreg_parallel") 40 regions found. Using 40 threads opt <- nlminb(obj$par, obj$fn, obj$gr) outer mgc: 1667230 ...

On 80 cores, using exactly same software

runExample('linreg_parallel') Running example linreg_parallel library(TMB) dyn.load(dynlib("linreg_parallel"))

Simulate data

set.seed(123) x <- seq(0, 10, length = 50001) data <- list(Y = rnorm(length(x)) + x, x = x) parameters <- list(a=0, b=0, logSigma=0)

Fit model

obj <- MakeADFun(data, parameters, DLL="linreg_parallel") 80 regions found. Using 80 threads terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted

is there an easy way to limit the number of cores used when running a parallel template ?

Unfortunately the amount of memory needed to run my model is only available on 80+ cores nodes.

Thanks for the great work

kaskr commented 4 years ago

If you need more than 48 threads you must set e.g.:

#define CPPAD_MAX_NUM_THREADS 100

right before

#include <TMB.hpp>

If that doesn't solve the problem you may try reducing memory by disabling parallel taping:

 TMB:::config(tape.parallel=FALSE)
kaskr commented 4 years ago

I didn't really answer the question:

is there an easy way to limit the number of cores used when running a parallel template ?

From R you can set e.g.

TMB:::openmp(40)

Or you can use the command line flag OMP_NUM_THREADS.

tiboloic commented 4 years ago

Brilliant ! my model is running. Many thanks for your help.

May I ask further questions on the performance/peak memory trade-off ?

  1. What is the likely effect on peak memory usage of using option atomic=FALSE in MakeADFun() ? I am asking this because for my model most of the taping time and memory usage seems to be dedicated to constructing atomic D_lgamma (my likelihood is a multinomial sampling)

  2. Is the peak memory usage when taping likely to increase linearly with the number of cpus when using parallel taping ? more generally, knowing peak memory usage using only 1 cpu, is there a rule of thumb to estimate peak memory usage using 100 cpus?

  3. Would a serialization strategy be viable ? such as:

All the best

kklot commented 4 years ago
1. What is the likely effect on peak memory usage  of using option atomic=FALSE in MakeADFun() ?
   I am asking this because for my model most of the taping time and memory usage seems to be dedicated to constructing atomic D_lgamma (my likelihood is a multinomial sampling)

I have similar problems with a dataset ~1 million and constructing D_lgamma used up 40-50Gb. @tiboloic, had you found something regarding this issue?

kaskr commented 4 years ago

The atomic argument to MakeADFun doesn't have any effect and will be deprecated in the future. @kklot Did you try to disable parallel taping as described above? If this sorts out the memory issues you may want to consider using PARALLEL_REGION macro instead of the parallel_accumulator - see the example here. The downside is that you have to mark all accumulation and thread local temporaries manually (in contrast to parallel_accumulator which works automatically but is less efficient).

kklot commented 4 years ago

yes - disable parallel taping has the memory issues sorted. Thanks a lot for pointing me to PARALLEL_REGION, did not know about it.