shikokuchuo / mirai

mirai - Minimalist Async Evaluation Framework for R
https://shikokuchuo.net/mirai/
GNU General Public License v3.0
193 stars 10 forks source link

Environment subscripting error with single-task daemons #43

Closed wlandau closed 1 year ago

wlandau commented 1 year ago

I have been trying to troubleshoot https://github.com/wlandau/crew/issues/51, and I ran across an issue where workers with maxtasks = 1 sometimes return tasks showing "Error in envir[[\".expr\"]]: subscript out of bounds". Here is a reproducible example. I ran mirai 0.8.1.9003 with nanonext 0.8.0.9001 on R 4.2.1 on an Ubuntu machine.

library(mirai)
library(nanonext)
daemons(n = 1L, url = "ws://127.0.0.1:5000")
tasks <- lapply(seq_len(100L), function(x) {
  mirai(x, x = x)
})
results <- list()
px <- NULL
launches <- 0L
while(length(results) < 100L) {
  if (is.null(px) || !px$is_alive()) {
    px <- callr::r_bg(function() {
      mirai::server("ws://127.0.0.1:5000", maxtasks = 1L)
    })
    launches <- launches + 1L
  }
  done <- integer(0L)
  for (i in seq_along(tasks)) {
    if (!.unresolved(tasks[[i]])) {
      done <- c(done, i)
      results[[length(results) + 1L]] <- tasks[[i]]
    }
  }
  tasks[done] <- NULL
}
print(launches)
#> [1] 100
data <- as.character(lapply(results, function(x) x$data))
print(data)
#> [1] "1"                                                   
#> [2] "2"                                                   
#> [3] "3"                                                   
#> [4] "4"                                                   
#> [5] "5"                                                   
#> [6] "6"                                                   
#> [7] "7"                                                   
#> [8] "Error in envir[[\".expr\"]]: subscript out of bounds"
#> ...
sum(grepl("^Error", data))
#> [1] 24
daemons(0L)
shikokuchuo commented 1 year ago

The above does not reproduce for me. I get "1" to "100" perfectly each time. I run R4.2.3 on Ubuntu 22.04 and I've tested multiple time using both Rstudio and from an interactive R prompt. Even tested on a low-powered Windows 10 netbook with an Intel Atom processor and it does not produce the above error. In fact I have never seen it before. Is it possible for you to come up with a more minimal example to help narrow this down? Thanks!

wlandau commented 1 year ago

You're right, I cannot reproduce this on my Macbook. I think it could be something strange about my Ubuntu machine. Here is a slightly smaller version of the same reprex.

library(mirai)
library(nanonext)
daemons(n = 1L, url = "ws://127.0.0.1:5000")
launches <- 0L
pids <- integer(0L)
while (length(pids) < 100L) {
  if (!exists("px") || !px$is_alive()) {
    px <- callr::r_bg(\() mirai::server("ws://127.0.0.1:5000", maxtasks = 1L))
    launches <- launches + 1L
  }
  if (!exists("m") || !.unresolved(m)) {
    if (exists("m")) pids <- c(pids, m$data)
    m <- mirai(ps::ps_pid())
  }
}
print(launches)
print(pids)
daemons(n = 0L)
shikokuchuo commented 1 year ago

Yes the above works for me as well, all unique PIDs. Just to eliminate one simple possibility - do you still get the odd behaviour on your Ubuntu setup if you use unresolved() instead of .unresolved()? This could be one of those corner cases where it doesn't result in the desired behaviour (and why it isn't the main unresolved checker).

wlandau commented 1 year ago

Thanks for the suggestion. I tried unresolved() instead of .unresolved(), and I still saw multiple instances of "Error in envir[[\".expr\"]]: subscript out of bounds" on my Ubuntu machine.

shikokuchuo commented 1 year ago

OK, worth a try. nanonext 0.8.1 is now on CRAN, and assorted improvements in mirai 0.8.1.9004. Nothing that would address the above though.

shikokuchuo commented 1 year ago

Given I can't reproduce, I don't really want to handle this situation specifically. At the minimum we should know what provokes it.

However, in terms of behaviour - instead of sending back the error, would it be better if the server exits instead? In that case, the task will get re-sent to another server. You would get more launches, but all your results.

wlandau commented 1 year ago

Hmmm I would prefer the existing error. I worry about masking the core problem as a silent efficiency issue. I wonder if another diagnostic might help along with the main error.

wlandau commented 1 year ago

My personal Ubuntu machine might have something strange going on with it, and it seems unlikely for this to appear on other machines.

Also, if another user catches it in a different scenario, that's valuable info that would help us track it down. With a silent relaunch, we might miss the chance.

shikokuchuo commented 1 year ago

When you get the chance, please can you try with f8eb9a9 (v0.8.1.9005) on your ubuntu machine. Thanks.

wlandau commented 1 year ago

Completely fixed on my Ubuntu machine! 100 launches and 100 unique PIDs from https://github.com/shikokuchuo/mirai/issues/43#issuecomment-1485043330. Thank you so much.

shikokuchuo commented 1 year ago

Awesome! My mistake - reasoning about the code sequentially. The truth is the NNG code at the C level is highly asynchronous, so things can complete out of order. Just needed a little extra synchronisation before continuing with the R code in the form of call_aio() !