BUG: lidR:::cluster_apply() relies on resolved() having a long `timeout` setting

HenrikBengtsson commented 3 years ago

In the next release of future, we're shortening how long future::resolve() will wait for results until giving up (and returning FALSE). Basically, we'll update the default from resolved(..., timeout = 0.2) to resolved(..., timeout = 0.01). This will improve the performance since we'll spend less time polling futures.

When running reverse package dependency checks, everything seemed to work at first, but when I force all package checks to use 'multisession' by default (instead of 'sequential'), I spotted a problem with lidR:::cluster_apply().

Here's a minimal reproducible example adopted from testthat/test-catalog-engine-generic.R:

# remotes::install_packages("HenrikBengtsson/future@75deeb83") # 2021-06-01
future::plan("sequential") ## works

future::plan("cluster", workers = 1L)
options(future.resolved.timeout = 0.20) ## works (old default)
options(future.resolved.timeout = 0.01) ## fails (new default)

library(lidR)

ctg <- lidR:::catalog_generator(2, 250)
opt_wall_to_wall(ctg) <- FALSE
opt_stop_early(ctg) <- FALSE

test <- function(cluster) {
  ymin <- raster::extent(cluster)@ymin
  message("raster::extent(cluster)@ymin = ", ymin)
  if (ymin > 80) stop("Test error")
  data.frame(X = 1:3)
}

y <- catalog_apply(ctg, test)

If I run with a timeout of 0.01 seconds, I get:

aster::extent(cluster)@ymin = 0.856999999999999
raster::extent(cluster)@ymin = 0.856999999999999
raster::extent(cluster)@ymin = 100.292
raster::extent(cluster)@ymin = 100.292
Error in (function (cluster)  : Test error

Enter a frame number, or 0 to exit   

 1: source("lidR-future-bug.R")
 2: withVisible(eval(ei, envir))
 3: eval(ei, envir)
 4: eval(ei, envir)
 5: lidR-future-bug.R#24: catalog_apply(ctg, test)
 6: cluster_apply(clusters, FUN, pop, oop, glo, rea, ...)
 7: suppressWarnings(future::value(futures[[j]]))
 8: withCallingHandlers(expr, warning = function(w) if (inherits(w, classes)) t
 9: future::value(futures[[j]])
10: value.Future(futures[[j]])
11: signalConditions(future, exclude = getOption("future.relay.immediate", "imm
12: stop(condition)
13: (function () 
{
    replicate(sink.number(), sink(NULL))
    if (interactive()

If I run with a timeout of 0.2 seconds, it works.

Could you please check and see if you can figure out why this happens?

PS. This is the only package out of ~180 that fails this way.

PS2. Although you should be able to trust all future packages to work the same, there are of course corner cases, etc. If you want to be extra certain that things works, I recommend that all package developers check their packages also with

$ export R_FUTURE_PLAN=multisession
$ R CMD check --as-cran mypkg_1.0.tar.gz

Jean-Romain commented 3 years ago

I reproduced, I found where the error was triggered I found the variable that was different in the two run. I should be able to figure out what happens. But it is non trivial and very hard to understand. I'm investigating

Jean-Romain commented 3 years ago

I fixed it. There was a bug in my code that could only be reached with a combination of stop_early = TRUE, a failure at the last object to process and an unlucky timing for calling resolved. I found the issue, I understand the issue but I'm still wondering how we reached it by changing the timeout setting

Thanks for reporting

HenrikBengtsson commented 3 years ago

I fixed it.

Awesome.

I found the issue, I understand the issue but I'm still wondering how we reached it by changing the timeout setting

Good. Yeah, the fact that different timeout lengths changed the behavior is odd, since resolved() is guaranteed to return either TRUE or FALSE. It could indicate a race condition, but I don't see how that would be the case.

Either way, thanks for fixing so quickly. You're packages tests, together with all the other 180 packages, are invaluable when it comes to bug fixing, improving, and introducing new features in the future framework.

Jean-Romain commented 3 years ago

the fact that different timeout lengths changed the behavior is odd, since resolved() is guaranteed to return either TRUE or FALSE.

The idea is that I have a for loop that looks like that

for (i in x) {
   f[i]= future(expr[i])

   for (j in 1:i) {
      states[j] = test_state(f[j])
      update_state_monitoring(states)
   }
}

where test_state() looks if the future is resolved and looks if it triggered a warning or an error or was ok. update_state_monitoring() displays in real time a map where it shows the spatial regions that are processing, waiting or that were processed ok or with warning or error. In case of error there is a trycatch block in test_state to fail gently and to allow to skip one region that won't be in the output.

The problem is that in parallel this loop may be ended but some elements might still be processing in the background (or remotely). Thus there is a second loop like that.

 while (any(states == PROCESSING))
 {
    idx <- which(states == PROCESSING)
     for (j in i) {
      states[j] = test_state(f[j])
      update_state_monitoring(states)
   }
   sleep()
}

In my case test_state() in this second loop was slightly different and the test for failing nicely was incomplete and missed to handle a case. The function test, used in the unit tests, that triggers an error by design at the second iteration to run some testing is so fast that the future was resolved in the first loop. But with the new timeout it was no longer resolved in the first loop and reached the second were there was a bug.

The error was a combination of the resolved timing, the use of stop_early = FALSE to skip errors (I never use this option) and the fact that it fails in the last object because it is a unit test but in true life it is unlikely to fail in the last object if the 100, 500 or 1000 previous ones where ok. We were very lucky to catch it actually.

You're packages tests, together with all the other 180 packages, are invaluable when it comes to bug fixing,

And the ability to run either local or remote parallel tasks in a single line of code without fully understanding how it works is even more valuable :wink:

r-lidar / lidR

BUG: lidR:::cluster_apply() relies on resolved() having a long `timeout` setting #435