psolymos / pbapply

Adding progress bar to '*apply' functions in R
https://peter.solymos.org/pbapply/
156 stars 6 forks source link

Futures: output and conditions are now relayed, but with glitch #59

Open HenrikBengtsson opened 1 year ago

HenrikBengtsson commented 1 year ago

With the support for futures, you now get automatic support for relaying of the standard output, messages, warnings, and other conditions, e.g. with

library(pbapply)
future::plan("multisession")

my_sqrt <- function(x) {
  if (x %% 10 == 0) message("x = ", x)
  sqrt(x)
}

we get:

> pboptions(type = "none")
> y <- pbsapply(1:100, FUN = my_sqrt, cl = "future")
x = 10
x = 20
x = 30
x = 40
x = 50
x = 60
x = 70
x = 80
x = 90
x = 100
> str(y)
 num [1:100] 1 1.41 1.73 2 2.24 ...

However, it looks like your progress bar is not prepared for having output generated in between the 0% step and the completion, e.g.

> pboptions(type = "txt")
> y <- pbsapply(1:100, FUN = my_sqrt, cl = "future")
  |++++                                              |   8%x = 10
  |++++++++                                          |  15%x = 20
  |++++++++++++                                      |  23%x = 30
  |+++++++++++++++                                   |  31%x = 40
  |+++++++++++++++++++++++                           |  46%x = 50
  |+++++++++++++++++++++++++++                       |  54%x = 60
  |+++++++++++++++++++++++++++++++                   |  62%x = 70
  |+++++++++++++++++++++++++++++++++++               |  69%x = 80
  |++++++++++++++++++++++++++++++++++++++++++        |  85%x = 90
  |++++++++++++++++++++++++++++++++++++++++++++++    |  92%x = 100
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
> 

The quick fix to avoid this, is to disable relaying of stdout and conditions in futures, e.g.

rval[i] <- list(future.apply::future_lapply(X[Split[[i]]], FUN, ..., future.stdout = FALSE, future.conditions = character(0L)))

That is how all other parallel frameworks work, i.e. they silently swallow any output.

Now, to actually relaying them, which is more useful, you have to buffer standard output and all conditions in:

https://github.com/psolymos/pbapply/blob/53aa5415157b709f2474a8e61f2cc472bb8eb575/R/pblapply.R#L73-L76

I'm thinking something like:

captureStdoutAndConditions <- function(expr, envir = parent.frame()) {
  expr <- substitute(expr)
  conditions <- list()
  withCallingHandlers({
    stdout <- utils::capture.output({
      value <- eval(expr, envir = envir)
    })
  }, condition = function(cond) {
    conditions <<- c(conditions, list(cond))
  })
  list(value = value, stdout = stdout, conditions = conditions)
}

relayStdoutAndConditions <- function(res) {
  cat(res$stdout)
  for (condition in res$conditions) {
    if (inherits(condition, "warning")) {
        warning(condition)
    } else if (inherits(condition, "message")) {
        message(condition)
    } else if (inherits(condition, "condition")) {
        signalCondition(condition)
    }
  }
  invisible(res)
}

and then use:

for (i in seq_len(B)) {
    res <- captureStdoutAndConditions({
      future.apply::future_lapply(X[Split[[i]]], FUN, ...)
    })
    rval[i] <- list(res$value)
    hide_pb(pb)
    relayStdoutAndConditions(res)
    unhide_pb(pb)
    setpb(pb, i)
}

PS. I've got future.mapreduce on the roadmap. One goal is to provide an API for others to build on and to avoid having to do manually do the above.

psolymos commented 1 year ago

Thanks @HenrikBengtsson for the outline, this is pretty cool. The progress bar is written to the main process's STDOUT (which is buffered, that's why we need to flush). Thus collecting output from the future processes makes a lot of sense, otherwise it can be quite hard to troubleshoot.

I'll add the silencer as a quick fix and play around with the other option.

psolymos commented 1 year ago

This looks like a more general issue, not just with future. Future prints the warning (STDERR), however, parallel::mclapply prints the messages to STDOUT, which is not much different regarding the output, but might require a different mechanism to capture that. However, clusters do not send info back to the main process:

Screenshot 2022-12-09 at 9 20 17 PM
psolymos commented 1 year ago

It looks like that parallel::mclapply write conditions to STDOUT of the main process, only the STDOUT of the forked processes can be silenced by setting mc.silent = TRUE.

To make the two options (future and mclapply) behave similarly, I only set future.stdout = FALSE but I did not modify how conditions are handled.

HenrikBengtsson commented 1 year ago

It looks like that parallel::mclapply write conditions to STDOUT of the main process

Actually not. See mclapply example in https://henrikbengtsson.github.io/future-tutorial-user2022/appendix.html#appendix-standard-output-by-other-parallel-map-reduce-apis

psolymos commented 1 year ago

Hmm. This is an eye opening read. Basically, if you want consistency, the "recommended" tools (what is in parallel) are the worst choice.