Add this to a vignette - Githubissues

topepo / agua-h2o-benchmark

MIT License

2 stars 0 forks source link

Add this to a vignette #1

Open topepo opened 2 years ago

topepo commented 2 years ago

agua has a development vignette about parallel processing.

We should add this to it. @qiushiyan

@ledell can you tell us what you think about it?

I might add one more configuration (using foreach and the default for h2o.grid()).

topepo commented 2 years ago

FYI I'll talk a little about this on the tidyverse blog in a day or so.

qiushiyan commented 2 years ago

I have a couple of questions on the relationship between h2o.init(nthreads) and h2o.grid(parallelism) and could use help from @ledell.

My understanding was that parallelism was specific to parallelize over grid searches that build many models at once.

On the other side nthreads controls the physical threads (not specifically designed for tuning?). It is used to speed up individual model operations with or without a grid search, such as searching for an optimial split.

So internally is parllelism achieved by setting higher nthreads? If users want to parallelize over both the params grid and individual internal model operations, should we set both options or just parallelism is enough?

topepo commented 2 years ago

My understanding was that parallelism was specific to parallelize over grid searches that build many models at once.

My guess is that the parallelism argument tells the server how many different models to train at once. I think that threading is about making an individual model faster.

ledell commented 2 years ago

@topepo Your understanding is correct.

ledell commented 2 years ago

@topepo I think the blog post seems fine, however I want to run it by some other H2O folks to see if there's any potential issues. the doMC approach seems nice since you don't need to make copies of the data it seems. About 5 years ago someone asked me how to do this on StackOverflow with doParallel and I gave the advice to create multiple H2O cores on each cluster since I think the data had to be duplicated anyway and that seemed cleaner than training M models across N cores in parallel. Maybe I need to update my answer...