Closed joshyam-k closed 8 months ago
Hey @graysonwhite just opening an issue for us to have a space to document our process working through this problem.
Were you able to run future::plan("multicore")
in RStudio? When setting that option for myself, I get the following warning:
Warning message: In supportsMulticoreAndRStudio(...) : [ONE-TIME WARNING] Forked processing ('multicore') is not supported when running R from RStudio because it is considered unstable. For more details, how to control forked processing or not, and how to silence this warning in future R sessions, see ?parallelly::supportsMulticore
Correct. RStudio views this process as unstable so I ran it through the terminal
closing this for now as it should no longer be an issue as we don't pass the population data to boot_rep
Grayson recently noticed that parallelization wasn't working as expected when trying to use
unit_zi
on a very large population data set (~ 18 million rows!). Namely when doing multisession future resolutionfuture::plan("multisession")
the process would crash and report the following error:Error: vector memory exhausted (limit reached?)
.My rough understanding about how parallel processing works with the package future is as follows: a "future" is an abstraction for a value that may be available at some point in the future. In our setting, futures are pieces of R code to be evaluated, and more specifically- bootstrap repetitions that are used for our mse estimation procedure. We speed up the process of doing the bootstrap by having the
future
package create a future for each bootstrap rep and then resolve those futures in whatever manner we specify. We can point exactly to where this process happens in our code:each
map
ing ofboot_rep
will be it's own future that will need to be resolved. In the case where we use the option "multisession" for future resolution, those futures are resolved via new R sessions that get launched by the future package. BUT, in order for those futures to be resolved, all of the necessary variables/data need to get copied over into each new R session. In our case eachboot_rep
requires the objectboot_pop_data
which contains as many rows as the population dataset. So, since the future package launches separate R sessions for each future and has to copy that extremely large object over for each one, then we end up with the memory exhuastion error.I noticed that things run just fine when I use
future::plan("multicore")
and I think the reason for this lies in this sentence that I found in the future package documentationSomehow this lessened overhead when forking allows us to sneak under the vector memory limit when doing multicore future resolution, which makes me wonder if somehow it isn't even necessarily the size of that
boot_pop_data
that's causing us problems, but rather the manner in which it's copied over. I'll spend some more time tinkering with this, but the reality is that we need that bootstrap population for each future to get resolved...