Open topepo opened 2 years ago
FYI I'll talk a little about this on the tidyverse blog in a day or so.
I have a couple of questions on the relationship between h2o.init(nthreads)
and h2o.grid(parallelism)
and could use help from @ledell.
My understanding was that parallelism
was specific to parallelize over grid searches that build many models at once.
On the other side nthreads
controls the physical threads (not specifically designed for tuning?). It is used to speed up individual model operations with or without a grid search, such as searching for an optimial split.
So internally is parllelism
achieved by setting higher nthreads
? If users want to parallelize over both the params grid and individual internal model operations, should we set both options or just parallelism
is enough?
My understanding was that
parallelism
was specific to parallelize over grid searches that build many models at once.
My guess is that the parallelism
argument tells the server how many different models to train at once. I think that threading is about making an individual model faster.
@topepo Your understanding is correct.
@topepo I think the blog post seems fine, however I want to run it by some other H2O folks to see if there's any potential issues. the doMC approach seems nice since you don't need to make copies of the data it seems. About 5 years ago someone asked me how to do this on StackOverflow with doParallel and I gave the advice to create multiple H2O cores on each cluster since I think the data had to be duplicated anyway and that seemed cleaner than training M models across N cores in parallel. Maybe I need to update my answer...
agua has a development vignette about parallel processing.
We should add this to it. @qiushiyan
@ledell can you tell us what you think about it?
I might add one more configuration (using foreach and the default for
h2o.grid()
).