mihaiconstantin / parabar

An `R` package for parallelizing tasks, tracking their progress, and displaying accurate progress bars.
https://parabar.mihaiconstantin.com
Other
18 stars 1 forks source link

Export package to backend? #29

Closed Hallo951 closed 1 year ago

Hallo951 commented 1 year ago

Hi,

I tried to execute a function from the "spTimer" package using the parsapply in parallel. However, this did not work because I need to pass the "spTimer" package to the backend first. However, I only found the "export" function in your package, which can be used to pass variables to the backend. But how do I pass a whole package? In the "parallel" package there is a function "clusterEvalQ". However, I have not found it in your package.

Or is the bug somewhere else and I'm just not getting it right?

best Frank

mihaiconstantin commented 1 year ago

Try running

parabar::evaluate({
    # Load library.
    library(spTimer)
})

before the parabar::par_sapply function call. This will ensure each node in the cluster has the library loaded. The evaluate function is documented here.

I will respond to your other questions in a bit.

Hallo951 commented 1 year ago

Thank you, but unfortunately it doesn't work with that either. I always get ZERO back. Here is my example:

i <- 1
# the variable "best_model" is a list of spTimer regression models (similar glm models but more complicated)
# the variable "data_predict_split" is a list of data.frames for prediction with the spTimer models

# Start an asynchronous backend
backend <- start_backend(cores = 2, cluster_type = "psock", backend_type = "async")

# export variables and package to cluster
parabar::export(backend, c("i","best_model")) # variables
parabar::evaluate(backend, c(library(spTimer))) # package

# Enable progress tracking
set_option("progress_track", TRUE)

# Change the progress bar options
configure_bar(type = "basic", style = 3)

# Run a task in parallel
pred <- par_sapply(backend, data_predict_split[predict_pass_selected[[1]]], function(x) {
        spTimer::predict.spT(best_model[[i]], tol.dist = 0.0, newdata = x, newcoords = unique(x[,colnames(x) %in% c("coord_x","coord_y")]))
      })

stop_backend(backend)

If I do the same as a foreach loop without progressbar, it works. Here the example:

# create a new cluster for parallel foreach loop
cl <- makePSOCKcluster(anz_cores_predict, outfile=""); registerDoParallel(cl) 

i <- 1

# parallel prediction
pred <- 
        foreach(x = data_predict_split[predict_pass_selected[[1]]], .packages = c("spTimer")) %dopar% {
          spTimer::predict.spT(best_model[[i]], tol.dist = 0.0, newdata = x, newcoords = unique(x[,colnames(x) %in% c("coord_x","coord_y")]))
        }

stopCluster(cl)

I know this is not a real minimal example but maybe you can still find a solution on this basis. Basically, I do a prediction in a parallel environment in the foreach loop based on a regression model. So only the regression model and the data for the prediction are passed to the cluster.

Here the structure from "data_predict_split[predict_pass_selected[[1]]]":

$ 1 :'data.frame':  49 obs. of  28 variables:
  ..$ ID                           : chr [1:49] "704" "704" "704" "704" ...
  ..$ Time                         : num [1:49] 2013 2014 2015 2016 2017 ...
  ..$ coord_x                      : num [1:49] 305072 305072 305072 305072 305072 ...
  ..$ coord_y                      : num [1:49] 5696220 5696220 5696220 5696220 5696220 ...
  ..$ air_temperature_max_median   : num [1:49] -1.74 1.341 -0.904 -0.491 0.486 ...
  ..$ air_temperature_max_std      : num [1:49] 1.49 -1.557 -0.697 0.933 0.165 ...
  ..$ air_temperature_mean_median  : num [1:49] -0.707 1.104 -0.216 -1.94 0.962 ...
  ..$ air_temperature_mean_std     : num [1:49] 1.566 -1.404 -0.714 0.941 0.362 ...
  ..$ air_temperature_min_median   : num [1:49] -0.578 1.992 -0.387 -1.35 0.695 ...
  ..$ air_temperature_min_std      : num [1:49] 1.694 -1.018 -0.647 1.021 0.72 ...
  ..$ frost_days_count             : num [1:49] 1.946 -1.446 -0.192 0.384 0.114 ...
  ..$ gwfa_median                  : num [1:49] -0.913 -0.4 -0.276 -0.302 -0.329 ...
  ..$ gwfa_sd                      : num [1:49] -1.4 -1.34 -1.37 -1.36 -1.36 ...
  ..$ hot_days_count               : num [1:49] -0.6269 -1.2513 1.5932 0.0392 -1.2929 ...
  ..$ ice_days_count               : num [1:49] 2.162 0.166 -1.008 -0.285 0.284 ...
  ..$ precipGE10mm_days_count      : num [1:49] 1.435 0.109 -0.858 -0.416 0.551 ...
  ..$ precipGE20mm_days_count      : num [1:49] 1.5672 -0.0659 -0.0659 -0.8825 0.7506 ...
  ..$ precipGE30mm_days_count      : num [1:49] -0.197 1.8401 0.0795 0.0795 0.0795 ...
  ..$ precipitation_hyras_de_median: num [1:49] 1.54424 -0.84426 -0.23575 -0.00462 0.2634 ...
  ..$ precipitation_hyras_de_std   : num [1:49] 0.569 1.668 0.294 -0.846 0.128 ...
  ..$ radiation_global_median      : num [1:49] -0.812 -0.59 -0.578 -0.645 -0.504 ...
  ..$ radiation_global_std         : num [1:49] -0.3198 -0.8205 0.0708 -0.4972 -1.009 ...
  ..$ snowcover_days_count         : num [1:49] 2.26 -0.375 -0.516 -0.388 0.317 ...
  ..$ summer_days_count            : num [1:49] -1.006 -1.397 -0.225 0.752 -0.42 ...
  ..$ sunshine_duration_median     : num [1:49] -1.195 -0.26 -0.277 -0.44 -0.733 ...
  ..$ sunshine_duration_std        : num [1:49] 0.551 -1.439 -0.184 0.211 -0.797 ...
  ..$ Auenlehmmaechtigkeit         : num [1:49] -0.092 -0.092 -0.092 -0.092 -0.092 ...
  ..$ DGM1                         : num [1:49] -1.36 -1.36 -1.36 -1.36 -1.36 ...
 $ 2 :'data.frame': 56 obs. of  28 variables:
  ..$ ID                           : chr [1:56] "711" "711" "711" "711" ...
  ..$ Time                         : num [1:56] 2013 2014 2015 2016 2017 ...
  ..$ coord_x                      : num [1:56] 305076 305076 305076 305076 305076 ...
  ..$ coord_y                      : num [1:56] 5695857 5695857 5695857 5695857 5695857 ...
  ..$ air_temperature_max_median   : num [1:56] -1.732 1.349 -0.794 -0.483 0.496 ...
  ..$ air_temperature_max_std      : num [1:56] 1.493 -1.556 -0.695 0.935 0.169 ...
  ..$ air_temperature_mean_median  : num [1:56] -0.712 1.104 -0.216 -1.945 0.962 ...
  ..$ air_temperature_mean_std     : num [1:56] 1.563 -1.413 -0.713 0.939 0.36 ...
  ..$ air_temperature_min_median   : num [1:56] -0.577 1.989 -0.386 -1.348 0.685 ...
  ..$ air_temperature_min_std      : num [1:56] 1.676 -1.035 -0.653 1.008 0.707 ...
  ..$ frost_days_count             : num [1:56] 1.9518 -1.4408 -0.1975 0.3812 0.0975 ...
  ..$ gwfa_median                  : num [1:56] -0.6035 0.0412 0.1364 0.0693 -0.0371 ...
  ..$ gwfa_sd                      : num [1:56] -1.43 -1.34 -1.36 -1.36 -1.37 ...
  ..$ hot_days_count               : num [1:56] -0.6269 -1.2272 1.5932 0.0392 -1.2929 ...
  ..$ ice_days_count               : num [1:56] 2.158 0.162 -1.012 -0.307 0.28 ...
  ..$ precipGE10mm_days_count      : num [1:56] 1.483 0.157 -0.858 -0.416 0.599 ...
  ..$ precipGE20mm_days_count      : num [1:56] 1.5672 -0.0659 -0.0659 -0.8825 0.7506 ...
  ..$ precipGE30mm_days_count      : num [1:56] 0.0795 1.8401 0.0795 0.0795 0.0795 ...
  ..$ precipitation_hyras_de_median: num [1:56] 1.5365 -0.8027 0.0836 0.0463 0.2518 ...
  ..$ precipitation_hyras_de_std   : num [1:56] 0.595 1.62 0.313 -0.846 0.166 ...
  ..$ radiation_global_median      : num [1:56] -0.807 -0.593 -0.579 -0.648 -0.504 ...
  ..$ radiation_global_std         : num [1:56] -0.3239 -0.8245 0.0655 -0.5067 -1.01 ...
  ..$ snowcover_days_count         : num [1:56] 2.244 -0.378 -0.519 -0.399 0.307 ...
  ..$ summer_days_count            : num [1:56] -1.006 -1.397 -0.225 0.752 -0.42 ...
  ..$ sunshine_duration_median     : num [1:56] -1.204 -0.26 -0.283 -0.45 -0.756 ...
  ..$ sunshine_duration_std        : num [1:56] 0.546 -1.447 -0.196 0.2 -0.805 ...
  ..$ Auenlehmmaechtigkeit         : num [1:56] 0.116 0.116 0.116 0.116 0.116 ...
  ..$ DGM1                         : num [1:56] -1.33 -1.33 -1.33 -1.33 -1.33 ...

Here the output from "best_model[[i]]":

image

The result from foreach loop is a list with spTimer predict objects.

mihaiconstantin commented 1 year ago

I find hard to understand what you want to achieve without a reproducible example (e.g., ?save). Since it is not clear to me what you are trying to parallelize, I can only give you some general pointers. Try to restructure your code such that it works with the R built-in parallel package, more specifically, with the parallel::parSapply function, or parallel::parLapply. For this part, perhaps StackOverflow can help. Then, it is just a matter of replacing parallel::parSapply with parabar::par_sapply to gain progress tracking. Here is the general idea:

# Start an asynchronous backend.
backend <- start_backend(cores = 2, cluster_type = "psock", backend_type = "async")

# Your repetitions or conditions.
repetitions <- 1:1000

# Run the task in parallel.
output <- par_sapply(
    backend,

    # Specify your repetitions or conditions.
    x = repetitions,

    # Apply a function to every element in the `repetitions` vector.
    fun = function(x) {
        # In this case add `1` to every element.
        x + 1
    }
)

# Expect that all elements of the repetition vector have `1` added.
all(output == repetitions + 1)

# Stop the backend.
stop_backend(backend)
mihaiconstantin commented 1 year ago

I think you should now be able to achieve what you need with support for par_lapply (i.e., via #31) and par_apply (i.e., via #36).

Feel free to reopen if you still need help! However, I think questions regarding structuring code for parallelization are better asked on StackOverflow with a minimal reproducible example.