mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
145 stars 26 forks source link

Q() segfaults on Windows & Linux with multiprocess scheduler with clustermq 0.9.0 #308

Closed luwidmer closed 9 months ago

luwidmer commented 9 months ago

I also get segfaults with the following even simpler Q() example (simpler than #306), both on Linux and Windows:

options(clustermq.scheduler = "multiprocess")
library(clustermq)
fun <- function(x) {x}

fun(1)
Q(fun = fun, x = 1:1000, n_jobs = 2)
Q(fun = fun, x = 1:1000, n_jobs = 2)

On Linux with R 4.1.0, this results in

Starting 2 processes ...
Running 1,000 calculations (5 objs/19.3 Kb common; 1 calls/chunk) ...
[===================================================>] 100% (2/2 wrk) eta:  0sAssertion failed: check () (src/msg.cpp:387)
Aborted

On Windows with R 4.3.0 this results in the same error as for @wlandau's example in #306:

Starting 2 processes ...
Running 1,000 calculations (5 objs/19.3 Kb common; 1 calls/chunk) ...
[===================================================>] 100% (2/2 wrk) eta:  0sAssertion failed: check () (../zeromq-4.3.4/src/msg.cpp:414)
luwidmer commented 9 months ago

Seems to be related to shutdown of workers given that it always happens after all jobs have completed. Also, can provoke this with LSF (but it is harder to reproduce, needs Q() in a loop)

luwidmer commented 9 months ago

@mschubert are you able to reproduce this as well?

mschubert commented 9 months ago

Yes: I can (occasionally) reproduce, and I'll try to track it down as soon as possible.

I'm also happy to report that I've got internet again at the place I moved to :sweat_smile:

mschubert commented 9 months ago

@luwidmer Can you check if it still occurs with the current git version?

remotes::install_github("mschubert/clustermq@master")
luwidmer commented 9 months ago
Starting 2 processes ...
Running 1,000 calculations (5 objs/19.3 Kb common; 1 calls/chunk) ...
[===================================================>] 100% (2/2 wrk) eta:  0sAssertion failed: check () (../zeromq-4.3.4/src/msg.cpp:414)

Unfortunately yes (I modified the version number in DESCRIPTION to be 0.9.0.12345 and that version indeed got loaded)

mschubert commented 9 months ago

I fixed another bug in https://github.com/mschubert/clustermq/commit/5612364c52f17ba98b241a3f1f7e067c02bad3fe, which may be the cause of this crash as well. Can you confirm if this now works? (same git install command as above)

luwidmer commented 9 months ago

I just really tried to provoke it with 1000s of Q() calls, that seems to have done it, superb @mschubert ! Might make sense to push this as 0.9.1 if no other big issues pop up?

mschubert commented 9 months ago

Great, thanks!

Yes, plan is to push 0.9.1 within the next few days, there are still some other issues to fix.