mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
145 stars 26 forks source link

`v0.9.0` backwards compatibility #303

Closed mschubert closed 11 months ago

mschubert commented 11 months ago

Currently, there are 4 packages which depend on clustermq on CRAN:

The historicalborrowlong and raveio tests pass with the new release, which they also should because the main R API is unchanged.

For drake and targets, the differences in the worker API break the tests. This is also expected, because the worker protocol/API changed.

So, provide a deprecated compatibility layer for the old worker API to give these packages a chance to adapt. As far as I can tell, the corresponding usage in these packages is:

wlandau commented 11 months ago

Thanks for the FYI, @mschubert. I will update targets and drake when clustermq 0.9.0 with the new API is on CRAN.

What exactly has changed? Can I just change the names of the methods I am calling, or are there less trivial changes to the protocol? Does https://mschubert.github.io/clustermq/articles/technicaldocs.html#main-event-loop reflect the changes? (It has been a long time since I looked at this.)

mschubert commented 11 months ago

Thanks @wlandau! The changes are mostly under the hood, for instance, most of the network protocol is completely changed and now handled with Rcpp. They manuals have not been updated yet.

The worker API changes (compatibility layer here), briefly, are:

I tested the current CRAN releases of both drake and targets, which now (v0.8.914) pass when checking the tarballs against R CMD check --as-cran.

Please let me know if you have any queries or concerns! There will likely be further minor adjustments until the v0.9 release on CRAN, so please also be aware of that.

mschubert commented 11 months ago

I consider this completed. Please reopen if there are any issues

wlandau commented 9 months ago

I see that version 0.9.0 is now on CRAN. The interface is much simpler and easier to understand.

However, I am having issues using it. On my M2 MacBook, I cannot seem to start multiprocess workers. Reprex:

options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(
  n_jobs = 1L,
  verbose = TRUE,
  log_workers = TRUE # I still do not see the error messages.
) 
Sys.sleep(2)
w$workers_running
#> [1] 0 # should be 1
sessionInfo()
#> R version 4.3.0 (2023-04-21)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.5.2
#> 
#> Matrix products: default
#> BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Indiana/Indianapolis
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] clustermq_0.9.0
#> 
#> loaded via a namespace (and not attached):
#> [1] processx_3.8.2    compiler_4.3.0    R6_2.5.1          tools_4.3.0       rstudioapi_0.15.0 Rcpp_1.0.11      
#> [7] codetools_0.2-19  callr_3.7.3       ps_1.7.5
wlandau commented 9 months ago

In addition: I tried integrating the new interface into https://github.com/ropensci/targets/blob/clustermq9/R/class_clustermq.R. Most of my tests seem to run okay on Linux, but the test at https://github.com/ropensci/targets/blob/fbcd33165381473e329488b4e9cbbba39ecbc7a8/tests/hpc/test-parallel.R#L5-L19 segfaults with the multiprocess backend, and in https://github.com/ropensci/targets/blob/fbcd33165381473e329488b4e9cbbba39ecbc7a8/tests/hpc/test-parallel.R#L23-L34 the send_shutdown() method does not seem to shut down any workers.

wlandau commented 9 months ago

Update: I found the worker logs on my Mac and saw:

2023-09-25 04:48:24.388871 | Master: tcp://MY_HOSTNAME:7074
2023-09-25 04:48:24.391963 | connecting to: tcp://MY_HOSTNAME:7074
Error : Connection failed after 10001 ms
mschubert commented 9 months ago

workers_running are only counted after a worker is registered, so you need to call w$recv() first (yielding NULL) - so this is expected, but the documentation should still be at bit clearer.

Are you using send_shutdown the same way as here (i.e., in the main loop), to which the following w$recv() registers the shutdown?

For the segfault: is this before or after the log timeout?

mschubert commented 9 months ago

Apparently, the cleanup is still a bit bugged:

> w = workers(1)
Starting 1 processes ...
> w$workers_running # worker is not registered yet
0
> w$recv() # no work done yet
NULL
> w$workers_running
1
> w$send_shutdown()
> w$workers_running() # sent only the message, worker is not shut down yet
1
> # w$recv() # causes segfault [bug]
> w$cleanup()
> w$workers_running # says 1, should say 0 [bug]
> w$recv() # times out, should say 0 [bug]
wlandau commented 9 months ago

workers_running are only counted after a worker is registered, so you need to call w$recv() first (yielding NULL) - so this is expected, but the documentation should still be at bit clearer.

I tried this on my Mac:

options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(n_jobs = 1L)
w$recv()

But it silently hangs at recv().

Are you using send_shutdown the same way as here (i.e., in the main loop), to which the following w$recv() registers the shutdown?

Yes, targets calls the analogous wait_or_shutdown() method at https://github.com/ropensci/targets/blob/fbcd33165381473e329488b4e9cbbba39ecbc7a8/R/class_clustermq.R#L195C5-L203, then the next call to iterate() begins with recv(): https://github.com/ropensci/targets/blob/fbcd33165381473e329488b4e9cbbba39ecbc7a8/R/class_clustermq.R#L233

For the segfault: is this before or after the log timeout?

Looking at the logs, I actually think these issues are different. The Mac OS logs show timeouts when I get hanging. It's hard to reproduce the segfaults on Linux. Trying again on that platform, I see "Error in w$?poll: Unexpected peer disconnect" in the log file, and on the main process I see "callr subprocess failed: could not start R, exited with non-zero status, has crashed or killed".

wlandau commented 9 months ago

I have also been seeing new segfaults and crashes on GitHub Actions since the release of 0.9.0: https://github.com/ropensci/targets/actions/runs/6296787129/job/17092456552 using the new interface and https://github.com/ropensci/targets/actions/runs/6296146489/job/17090587515 using the old interface, although I haven't had time to empirically isolated either to clustemrq.

mschubert commented 9 months ago

Yes, something is not quite right yet. If you find a way to reproduce cmq-only without the compatibility layer please open an issue!

wlandau commented 9 months ago

The following clustermq-only reprex is a simplified version of what targets is trying to do. (I omit w$cleanup() to test the w$send_shutdown().) Not every run segfaults, but many runs do.

options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(2L, log_worker = TRUE)
queue <- seq_len(10L)
running <- integer(0L)
done <- integer(0L)
while (length(done) < 100L) {
  result <- w$recv()
  if (!is.null(result)) {
    message("done task ", result)
    done <- c(done, result)
    running <- setdiff(running, result)
  }
  if (length(running) < 2L && length(queue) > 0L) {
    next_task <- queue[1L]
    message("send task ", next_task)
    queue <- queue[-1L]
    running <- c(running, next_task)
    w$send(cmd = index, index = next_task)
  } else if (length(queue) > 0L) {
    w$send_wait()
  } else {
    w$send_shutdown()
  }
}

On a segfault, the error log of the worker reads:

2023-09-25 06:34:27.520023 | Master: tcp://haggunenon:7807
2023-09-25 06:34:27.521489 | connecting to: tcp://haggunenon:7807
2023-09-25 06:34:27.581130 | > call 1 (0.007s wait)
2023-09-25 06:34:27.637110 | > call 2 (0.003s wait)
2023-09-25 06:34:27.709159 | > call 3 (0.020s wait)
2023-09-25 06:34:27.761167 | > call 4 (0.003s wait)
Error in w$poll() : Unexpected peer disconnect

I am using Ubuntu for this test. (On Mac OS, as I have said, w$recv() hangs in a much simpler example.)

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /home/landau/R/R-4.3.0/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustermq_0.9.0

loaded via a namespace (and not attached):
[1] compiler_4.3.0   R6_2.5.1         tools_4.3.0      rstudioapi_0.14 
[5] Rcpp_1.0.11      codetools_0.2-19
wlandau commented 9 months ago

If you find a way to reproduce cmq-only without the compatibility layer please open an issue!

Thanks, just submitted #306.

mschubert commented 9 months ago

I tried this on my Mac: [...] But it silently hangs at recv().

New issue here: https://github.com/mschubert/clustermq/issues/311