Closed mschubert closed 11 months ago
Thanks for the FYI, @mschubert. I will update targets
and drake
when clustermq
0.9.0 with the new API is on CRAN.
What exactly has changed? Can I just change the names of the methods I am calling, or are there less trivial changes to the protocol? Does https://mschubert.github.io/clustermq/articles/technicaldocs.html#main-event-loop reflect the changes? (It has been a long time since I looked at this.)
Thanks @wlandau! The changes are mostly under the hood, for instance, most of the network protocol is completely changed and now handled with Rcpp
. They manuals have not been updated yet.
The worker API changes (compatibility layer here), briefly, are:
workers
function now creates a Pool
object, which can contain different kinds of workers (the implementation of the latter is not yet complete)set_common_data
, send_common_data
does not exists anymore, instead, use workers$env(var=value, ...)
; these objects will be propagated to the workers automatically (greedy), there is no need to explicitly send them; similarly, the token
concept no longer exists and there are no required objects in the worker environmentsend_shutdown_worker
has been renamed to send_shutdown
send_call
is now send
, which always sends an unevaluated call using NSE (tracking which call has which result is not completed yet); so for instance for the main clustermq::Q
, working on a chunk now uses send(work_chunk(...))
instead of the DO_CHUNK
messagesreceive_data
is now recv
workers$finalize()
is now private, according to the R6
recommendation (I will likely revert this until v1.0
)I tested the current CRAN releases of both drake
and targets
, which now (v0.8.914
) pass when checking the tarballs against R CMD check --as-cran
.
Please let me know if you have any queries or concerns! There will likely be further minor adjustments until the v0.9
release on CRAN, so please also be aware of that.
I consider this completed. Please reopen if there are any issues
I see that version 0.9.0 is now on CRAN. The interface is much simpler and easier to understand.
However, I am having issues using it. On my M2 MacBook, I cannot seem to start multiprocess workers. Reprex:
options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(
n_jobs = 1L,
verbose = TRUE,
log_workers = TRUE # I still do not see the error messages.
)
Sys.sleep(2)
w$workers_running
#> [1] 0 # should be 1
sessionInfo()
#> R version 4.3.0 (2023-04-21)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.5.2
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Indiana/Indianapolis
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] clustermq_0.9.0
#>
#> loaded via a namespace (and not attached):
#> [1] processx_3.8.2 compiler_4.3.0 R6_2.5.1 tools_4.3.0 rstudioapi_0.15.0 Rcpp_1.0.11
#> [7] codetools_0.2-19 callr_3.7.3 ps_1.7.5
In addition: I tried integrating the new interface into https://github.com/ropensci/targets/blob/clustermq9/R/class_clustermq.R. Most of my tests seem to run okay on Linux, but the test at https://github.com/ropensci/targets/blob/fbcd33165381473e329488b4e9cbbba39ecbc7a8/tests/hpc/test-parallel.R#L5-L19 segfaults with the multiprocess backend, and in https://github.com/ropensci/targets/blob/fbcd33165381473e329488b4e9cbbba39ecbc7a8/tests/hpc/test-parallel.R#L23-L34 the send_shutdown()
method does not seem to shut down any workers.
Update: I found the worker logs on my Mac and saw:
2023-09-25 04:48:24.388871 | Master: tcp://MY_HOSTNAME:7074
2023-09-25 04:48:24.391963 | connecting to: tcp://MY_HOSTNAME:7074
Error : Connection failed after 10001 ms
workers_running
are only counted after a worker is registered, so you need to call w$recv()
first (yielding NULL
) - so this is expected, but the documentation should still be at bit clearer.
Are you using send_shutdown
the same way as here (i.e., in the main loop), to which the following w$recv()
registers the shutdown?
For the segfault: is this before or after the log timeout?
Apparently, the cleanup is still a bit bugged:
> w = workers(1)
Starting 1 processes ...
> w$workers_running # worker is not registered yet
0
> w$recv() # no work done yet
NULL
> w$workers_running
1
> w$send_shutdown()
> w$workers_running() # sent only the message, worker is not shut down yet
1
> # w$recv() # causes segfault [bug]
> w$cleanup()
> w$workers_running # says 1, should say 0 [bug]
> w$recv() # times out, should say 0 [bug]
workers_running are only counted after a worker is registered, so you need to call w$recv() first (yielding NULL) - so this is expected, but the documentation should still be at bit clearer.
I tried this on my Mac:
options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(n_jobs = 1L)
w$recv()
But it silently hangs at recv()
.
Are you using send_shutdown the same way as here (i.e., in the main loop), to which the following w$recv() registers the shutdown?
Yes, targets
calls the analogous wait_or_shutdown()
method at https://github.com/ropensci/targets/blob/fbcd33165381473e329488b4e9cbbba39ecbc7a8/R/class_clustermq.R#L195C5-L203, then the next call to iterate()
begins with recv()
: https://github.com/ropensci/targets/blob/fbcd33165381473e329488b4e9cbbba39ecbc7a8/R/class_clustermq.R#L233
For the segfault: is this before or after the log timeout?
Looking at the logs, I actually think these issues are different. The Mac OS logs show timeouts when I get hanging. It's hard to reproduce the segfaults on Linux. Trying again on that platform, I see "Error in w$?poll: Unexpected peer disconnect" in the log file, and on the main process I see "callr subprocess failed: could not start R, exited with non-zero status, has crashed or killed".
I have also been seeing new segfaults and crashes on GitHub Actions since the release of 0.9.0: https://github.com/ropensci/targets/actions/runs/6296787129/job/17092456552 using the new interface and https://github.com/ropensci/targets/actions/runs/6296146489/job/17090587515 using the old interface, although I haven't had time to empirically isolated either to clustemrq
.
Yes, something is not quite right yet. If you find a way to reproduce cmq-only without the compatibility layer please open an issue!
The following clustermq
-only reprex is a simplified version of what targets
is trying to do. (I omit w$cleanup()
to test the w$send_shutdown()
.) Not every run segfaults, but many runs do.
options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(2L, log_worker = TRUE)
queue <- seq_len(10L)
running <- integer(0L)
done <- integer(0L)
while (length(done) < 100L) {
result <- w$recv()
if (!is.null(result)) {
message("done task ", result)
done <- c(done, result)
running <- setdiff(running, result)
}
if (length(running) < 2L && length(queue) > 0L) {
next_task <- queue[1L]
message("send task ", next_task)
queue <- queue[-1L]
running <- c(running, next_task)
w$send(cmd = index, index = next_task)
} else if (length(queue) > 0L) {
w$send_wait()
} else {
w$send_shutdown()
}
}
On a segfault, the error log of the worker reads:
2023-09-25 06:34:27.520023 | Master: tcp://haggunenon:7807
2023-09-25 06:34:27.521489 | connecting to: tcp://haggunenon:7807
2023-09-25 06:34:27.581130 | > call 1 (0.007s wait)
2023-09-25 06:34:27.637110 | > call 2 (0.003s wait)
2023-09-25 06:34:27.709159 | > call 3 (0.020s wait)
2023-09-25 06:34:27.761167 | > call 4 (0.003s wait)
Error in w$poll() : Unexpected peer disconnect
I am using Ubuntu for this test. (On Mac OS, as I have said, w$recv()
hangs in a much simpler example.)
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /home/landau/R/R-4.3.0/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] clustermq_0.9.0
loaded via a namespace (and not attached):
[1] compiler_4.3.0 R6_2.5.1 tools_4.3.0 rstudioapi_0.14
[5] Rcpp_1.0.11 codetools_0.2-19
If you find a way to reproduce cmq-only without the compatibility layer please open an issue!
Thanks, just submitted #306.
I tried this on my Mac: [...] But it silently hangs at
recv()
.
New issue here: https://github.com/mschubert/clustermq/issues/311
Currently, there are 4 packages which depend on
clustermq
on CRAN:The
historicalborrowlong
andraveio
tests pass with the new release, which they also should because the main R API is unchanged.For
drake
andtargets
, the differences in the worker API break the tests. This is also expected, because the worker protocol/API changed.So, provide a deprecated compatibility layer for the old worker API to give these packages a chance to adapt. As far as I can tell, the corresponding usage in these packages is: