statistikat / simPop

Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information
30 stars 7 forks source link

Using argument nr_cpus for parallelization might lead to long run time #27

Open JohannesGuss opened 1 year ago

JohannesGuss commented 1 year ago

Running the following code lead to long run times or no termination at all

###############################
library(data.table)
devtools::install_github("statistikat/simPop")
library(simPop)
# examples for xgboost

## load the demo data set
data(eusilcS)

## create the structure
inp <- specifyInput(data = eusilcS,
          hhid = "db030", # variable with cluster information
          strata = "db040",
          weight = "db090" # variable with sampling weights
)
simPop <- simStructure(data=inp,
            method="direct",
            basicHHvars=c("age", "rb090"))
simPop

model_params <- list(max.depth = 10, eta = 0.5, nrounds = 5, objective = "multi:softprob")
simPop <- simCategorical(simPop,
             additional = c("pl030", "pb220a"),
             method = "xgboost",
             model_params = model_params)
simPop

Changing the last function call to disable parallelisation yields immediate results

simPop <- simCategorical(simPop,
             additional = c("pl030", "pb220a"),
             method = "xgboost",
             nr_cpus = 1,                                   # <- disable parallelisation
             model_params = model_params)
simPop

Propose to try other packages for parallelisation like parallelly(https://cran.r-project.org/web/packages/parallelly/) or future.apply (https://cran.r-project.org/web/packages/future.apply)