mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
145 stars 26 forks source link

Receiving data after shutting down workers results in a segfault #306

Closed wlandau closed 9 months ago

wlandau commented 9 months ago

The following clustermq-only reprex is a simplified version of what targets is trying to do. (I omit w$cleanup() to test the w$send_shutdown().) Not every run segfaults, but many runs do.

options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(2L, log_worker = TRUE)
queue <- seq_len(10L)
running <- integer(0L)
done <- integer(0L)
while (length(done) < 100L) {
  result <- w$recv()
  if (!is.null(result)) {
    message("done task ", result)
    done <- c(done, result)
    running <- setdiff(running, result)
  }
  if (length(running) < 2L && length(queue) > 0L) {
    next_task <- queue[1L]
    message("send task ", next_task)
    queue <- queue[-1L]
    running <- c(running, next_task)
    w$send(cmd = index, index = next_task)
  } else if (length(queue) > 0L) {
    w$send_wait()
  } else {
    w$send_shutdown()
  }
}

On a segfault, the error log of the worker reads:

2023-09-25 06:34:27.520023 | Master: tcp://haggunenon:7807
2023-09-25 06:34:27.521489 | connecting to: tcp://haggunenon:7807
2023-09-25 06:34:27.581130 | > call 1 (0.007s wait)
2023-09-25 06:34:27.637110 | > call 2 (0.003s wait)
2023-09-25 06:34:27.709159 | > call 3 (0.020s wait)
2023-09-25 06:34:27.761167 | > call 4 (0.003s wait)
Error in w$poll() : Unexpected peer disconnect

I am using Ubuntu for this test. (On Mac OS, as I have said, w$recv() hangs in a much simpler example.)

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /home/landau/R/R-4.3.0/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustermq_0.9.0

loaded via a namespace (and not attached):
[1] compiler_4.3.0   R6_2.5.1         tools_4.3.0      rstudioapi_0.14 
[5] Rcpp_1.0.11      codetools_0.2-19
wlandau commented 9 months ago

@mschubert, you had suggested in https://github.com/mschubert/clustermq/issues/303#issuecomment-1733277895 that I post a new issue to follow up on specific problems using #303, so I hope this helps.

luwidmer commented 9 months ago

I just tried @wlandau's example (I added a message() before the send shutdown for clarity) on Windows, and it deterministically dies with an assertion failure:

Rscript clustermq-090-test.R
send task 1
done task 1
send task 2
done task 2
send task 3
send task 4
done task 3
send task 5
done task 4
send task 6
done task 5
send task 7
done task 6
send task 8
done task 7
send task 9
done task 8
send task 10
done task 9
send_shutdown
Assertion failed: check () (../zeromq-4.3.4/src/msg.cpp:414)

This is on R 4.3.0, see sessionInfo()

> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=German_Switzerland.utf8  LC_CTYPE=German_Switzerland.utf8    LC_MONETARY=German_Switzerland.utf8
[4] LC_NUMERIC=C                        LC_TIME=German_Switzerland.utf8    

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustermq_0.9.0

loaded via a namespace (and not attached):
[1] compiler_4.3.0   tools_4.3.0      Rcpp_1.0.11      codetools_0.2-19
mschubert commented 9 months ago

You seem to have a bug in your example code, where you keep trying to receive data from workers after they are all shut down (loop goes to 100, tasks go to 10).

Minimal code to reproduce the same behavior:

options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(1L, log_worker = TRUE)
w$recv()
w$send_shutdown()
w$recv() # invalid vector index

However, this should throw an error in R, not crash the session.

wlandau commented 9 months ago

You seem to have a bug in your example code, where you keep trying to receive data from workers after they are all shut down (loop goes to 100, tasks go to 10).

Hmm... I tried to fix the original example to avoid calling a shutdown too many times:

options(clustermq.scheduler = "multiprocess")
library(clustermq)
w <- workers(2L, log_worker = TRUE)
active <- 2L
queue <- seq_len(10L)
running <- integer(0L)
done <- integer(0L)
while (length(done) < 100L) {
  result <- w$recv()
  if (!is.null(result)) {
    message("done task ", result)
    done <- c(done, result)
    running <- setdiff(running, result)
  }
  if (length(running) < 2L && length(queue) > 0L) {
    next_task <- queue[1L]
    message("send task ", next_task)
    queue <- queue[-1L]
    running <- c(running, next_task)
    w$send(cmd = index, index = next_task)
  } else if (length(queue) > 0L) {
    w$send_wait()
  } else if (active > 0L) {
    w$send_shutdown()
    active <- active - 1L
  }
}

It hung for several minutes without printing any messages to the R console. The log files show:

2023-10-10 15:51:44.776181 | Master: tcp://CENSORED:9786
2023-10-10 15:51:44.779649 | connecting to: tcp://CENSORED:9786
Error : Connection failed after 10001 ms

and

2023-10-10 15:51:44.776096 | Master: tcp://CENSORED:9786
2023-10-10 15:51:44.779632 | connecting to: tcp://CENSORED:9786
Error : Connection failed after 10001 ms

I used the CRAN version because I could not compile the development version.

remotes::install_github("mschubert/clustermq")
Using github PAT from envvar GITHUB_PAT
Downloading GitHub repo mschubert/clustermq@HEAD
'/usr/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/zeromq/libzmq.git /var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/remotes16e914d76c6fb/mschubert-clustermq-ed2bf6e/src/libzmq
Cloning into '/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/remotes16e914d76c6fb/mschubert-clustermq-ed2bf6e/src/libzmq'...
'/usr/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/zeromq/cppzmq.git /var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/remotes16e914d76c6fb/mschubert-clustermq-ed2bf6e/src/cppzmq
Cloning into '/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/remotes16e914d76c6fb/mschubert-clustermq-ed2bf6e/src/cppzmq'...
── R CMD build ──────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/private/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T/RtmpGZmLLp/remotes16e914d76c6fb/mschubert-clustermq-ed2bf6e/DESCRIPTION’ ...
─  preparing ‘clustermq’: (1.1s)
✔  checking DESCRIPTION meta-information ...
─  cleaning src
─  running ‘cleanup’
─  checking for LF line-endings in source and make files and shell scripts (1.1s)
─  checking for empty or unneeded directories (2s)
   Removed empty directory ‘clustermq/src/libzmq/build_qnx/nto/aarch64/le’
   Removed empty directory ‘clustermq/src/libzmq/build_qnx/nto/aarch64’
   Removed empty directory ‘clustermq/src/libzmq/build_qnx/nto/x86_64/o’
   Removed empty directory ‘clustermq/src/libzmq/build_qnx/nto/x86_64’
   Removed empty directory ‘clustermq/src/libzmq/build_qnx/nto’
   Removed empty directory ‘clustermq/src/libzmq/builds/openwrt’
─  building ‘clustermq_0.9.0.tar.gz’

* installing *source* package ‘clustermq’ ...
** using staged installation
* no system libzmq found -> using bundled libzmq
autoreconf: export WARNINGS=
autoreconf: Entering directory '.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal -I config --force -I config
m4:configure.ac:9: ERROR: end of file in string
autom4te: error: /opt/homebrew/opt/m4/bin/m4 failed with exit status: 1
aclocal: error: /opt/homebrew/Cellar/autoconf/2.71/bin/autom4te failed with exit status: 1
autoreconf: error: aclocal failed with exit status: 1
autogen.sh: error: autoreconf exited with status 1
./configure: line 61: die: command not found
./configure: line 64: ./configure: No such file or directory
make: *** No targets specified and no makefile found.  Stop.
ERROR: configuration failed for package ‘clustermq’
* removing ‘/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/clustermq’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/clustermq’
Warning messages:
1: In utils::install.packages(pkgs = pkgs, lib = lib, repos = myrepos,  :
  installation of package ‘/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/file16e913101b078/clustermq_0.9.0.tar.gz’ had non-zero exit status
2: In utils::install.packages(pkgs = pkgs, lib = lib, repos = myrepos,  :
  installation of package ‘/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//RtmpGZmLLp/file16e913101b078/clustermq_0.9.0.tar.gz’ had non-zero exit status
mschubert commented 9 months ago

Ah, I'm rewriting the configure script and this will have some rough edges until everything is done.

Can you try https://github.com/mschubert/clustermq/commit/5612364c52f17ba98b241a3f1f7e067c02bad3fe from Github?

The "unable to connect" on macOS is a mystery to me: see https://github.com/mschubert/clustermq/issues/311. But this happens on CI too.

wlandau commented 9 months ago

Thanks for working on this. I got a similar compilation error:

> remotes::install_github("mschubert/clustermq", ref = "5612364c52f17ba98b241a3f1f7e067c02bad3fe")
Using github PAT from envvar GITHUB_PAT
Downloading GitHub repo mschubert/clustermq@5612364c52f17ba98b241a3f1f7e067c02bad3fe
'/usr/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/zeromq/libzmq.git /var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/remotesdea51cd77bdf/mschubert-clustermq-5612364/src/libzmq
Cloning into '/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/remotesdea51cd77bdf/mschubert-clustermq-5612364/src/libzmq'...
'/usr/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules https://github.com/zeromq/cppzmq.git /var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/remotesdea51cd77bdf/mschubert-clustermq-5612364/src/cppzmq
Cloning into '/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/remotesdea51cd77bdf/mschubert-clustermq-5612364/src/cppzmq'...
── R CMD build ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/private/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T/Rtmp0lFXsv/remotesdea51cd77bdf/mschubert-clustermq-5612364/DESCRIPTION’ ...
─  preparing ‘clustermq’: (1.6s)
✔  checking DESCRIPTION meta-information
─  cleaning src
─  running ‘cleanup’
─  checking for LF line-endings in source and make files and shell scripts (549ms)
─  checking for empty or unneeded directories (2.1s)
   Removed empty directory ‘clustermq/src/libzmq/build_qnx/nto/aarch64/le’
   Removed empty directory ‘clustermq/src/libzmq/build_qnx/nto/aarch64’
   Removed empty directory ‘clustermq/src/libzmq/build_qnx/nto/x86_64/o’
   Removed empty directory ‘clustermq/src/libzmq/build_qnx/nto/x86_64’
   Removed empty directory ‘clustermq/src/libzmq/build_qnx/nto’
   Removed empty directory ‘clustermq/src/libzmq/builds/openwrt’
─  building ‘clustermq_0.9.0.tar.gz’

* installing *source* package ‘clustermq’ ...
** using staged installation
sed: include/zmq_utils.h.orig: No such file or directory
autogen.sh: error: could not find autoreconf.  autoconf and automake are required to run autogen.sh.
./configure: line 35: die: command not found
./configure: line 38: ./configure: No such file or directory
make: *** No targets specified and no makefile found.  Stop.
ERROR: configuration failed for package ‘clustermq’
* removing ‘/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/clustermq’
Warning messages:
1: In utils::install.packages(pkgs = pkgs, lib = lib, repos = myrepos,  :
  installation of package ‘/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/filedea533c9091/clustermq_0.9.0.tar.gz’ had non-zero exit status
2: In utils::install.packages(pkgs = pkgs, lib = lib, repos = myrepos,  :
  installation of package ‘/var/folders/4v/vh7xp8553lsbl49svl48g7p00000gp/T//Rtmp0lFXsv/filedea533c9091/clustermq_0.9.0.tar.gz’ had non-zero exit status
mschubert commented 9 months ago

autoconf and automake are required to run autogen.sh

I see now: coreutils and automake are required to compile from Github

wlandau commented 9 months ago

I installed creutils and automake, and I still saw a compilation error with sed: include/zmq_utils.h.orig: No such file or directory. Maybe https://github.com/mschubert/clustermq/pull/312 will help?

mschubert commented 9 months ago

It's a typo in the configure script that's been fixed in the current HEAD (but it didn't stop compilation on GHA CI, so I'm not sure why it did for you)