harvard-ufds / saeczi

Small Area Estimation for Continuous Zero Inflated data
https://harvard-ufds.github.io/saeczi/
Other
4 stars 2 forks source link

troubleshooting "exhausted memory" issue in futures #4

Closed joshyam-k closed 8 months ago

joshyam-k commented 12 months ago

Grayson recently noticed that parallelization wasn't working as expected when trying to use unit_zi on a very large population data set (~ 18 million rows!). Namely when doing multisession future resolution future::plan("multisession") the process would crash and report the following error: Error: vector memory exhausted (limit reached?).

My rough understanding about how parallel processing works with the package future is as follows: a "future" is an abstraction for a value that may be available at some point in the future. In our setting, futures are pieces of R code to be evaluated, and more specifically- bootstrap repetitions that are used for our mse estimation procedure. We speed up the process of doing the bootstrap by having the future package create a future for each bootstrap rep and then resolve those futures in whatever manner we specify. We can point exactly to where this process happens in our code:

x |> future_map( ~{
        p()
        boot_rep(
          boot_pop_data,
          samp_dat,
          domain_level,
          boot_lin_formula,
          boot_log_formula
        )
        },
        .options = furrr_options(seed = TRUE))

each maping of boot_rep will be it's own future that will need to be resolved. In the case where we use the option "multisession" for future resolution, those futures are resolved via new R sessions that get launched by the future package. BUT, in order for those futures to be resolved, all of the necessary variables/data need to get copied over into each new R session. In our case each boot_rep requires the object boot_pop_data which contains as many rows as the population dataset. So, since the future package launches separate R sessions for each future and has to copy that extremely large object over for each one, then we end up with the memory exhuastion error.

I noticed that things run just fine when I use future::plan("multicore") and I think the reason for this lies in this sentence that I found in the future package documentation

"Forking an R process can be faster than working with a separate R session running in the background. One reason is that the overhead of exporting large globals to the background session can be greater than when forking, and therefore shared memory, is used. "

Somehow this lessened overhead when forking allows us to sneak under the vector memory limit when doing multicore future resolution, which makes me wonder if somehow it isn't even necessarily the size of that boot_pop_data that's causing us problems, but rather the manner in which it's copied over. I'll spend some more time tinkering with this, but the reality is that we need that bootstrap population for each future to get resolved...

joshyam-k commented 12 months ago

Hey @graysonwhite just opening an issue for us to have a space to document our process working through this problem.

graysonwhite commented 12 months ago

Were you able to run future::plan("multicore") in RStudio? When setting that option for myself, I get the following warning:

Warning message: In supportsMulticoreAndRStudio(...) : [ONE-TIME WARNING] Forked processing ('multicore') is not supported when running R from RStudio because it is considered unstable. For more details, how to control forked processing or not, and how to silence this warning in future R sessions, see ?parallelly::supportsMulticore

joshyam-k commented 12 months ago

Correct. RStudio views this process as unstable so I ran it through the terminal

joshyam-k commented 8 months ago

closing this for now as it should no longer be an issue as we don't pass the population data to boot_rep