mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Minimal SSH localhost test stalls on Ubuntu but not CentOS #112

Closed HenrikBengtsson closed 5 years ago

HenrikBengtsson commented 5 years ago

I'd like to set up tests etc based on localhost workers without having to rely on having schedulers installed on the system. It sounds like SSH workers would do this.

The following works on a CentOS 7 machine with R 3.5.1, clustermq 0.8.5, rzmq 0.9.4, and libzmq.so.3:

> options(clustermq.scheduler="ssh", clustermq.ssh.host="localhost")
> library(clustermq)
> y <- Q(identity, 42, n_jobs=1)
Connecting localhost via SSH ...
Sending common data ...
Submitting 1 worker jobs (ID: 6366) ...
Running 1 calculations (1 calls/chunk) ...
Running 1 calculations (1 calls/chunk) ...
Master: [17.5s 1.2% CPU]; Worker: [avg 26.8% CPU, max 247.8 Mb]
> str(y)
List of 1
 $ : num 42

altough it's surprisingly slow (15-20 seconds) - is that due to the set up of ZeroMQ or something else?

However, with clustermq 0.8.5, rzmq_0.9.4 and libzmq.so.5, on Ubuntu 16.04 with R 3.4.4 as well as Ubuntu 18.04 with R 3.5.1 it stalls;

> y <- Q(identity, 42, n_jobs=1)
Connecting localhost via SSH ...
Sending common data ...
Submitting 1 worker jobs (ID: 7110) ...
^C
[ ... stalls ... ]

Interrupting it reveals:

^C

Enter a frame number, or 0 to exit   

1: Q(identity, 42, n_jobs = 1)
2: Q_rows(fun = fun, df = df, const = const, export = export, seed = seed, memory = memory, templa
3: master(qsys = workers, iter = df, rettype = rettype, fail_on_error = fail_on_error, chunk_size 
4: qsys$receive_data(timeout = timeout)
5: rzmq::poll.socket(list(private$socket), list("read"), timeout = msec)
6: (function () 
{
    utils::recover()
})()

(FYI, I'm using the function in 6 for my error option.)

Any suggestions?

Also, may I suggest an option to disable the reverse SSH tunneling when running on localhost/127.0.0.1?

wlandau commented 5 years ago

What about options(clustermq.scheduler = "multicore")? I use it to create localhost workers in drake's test suite.

HenrikBengtsson commented 5 years ago

... options(clustermq.scheduler = "multicore") ...

That works great. Thxs. (It's not easy to find out what options are supports from the help/wiki)

Although my immediate needs are taken care of, I leave this issue open since I think it would be useful to validate/test against a localhost SSH worker.

mschubert commented 5 years ago

It's not easy to find out what options are supports from the help/wiki

The multicore backend is indeed hardly documented. I added this now to wiki/Configuration#testing-locally.

Local SSH should work (as it does for you on CentOS).

Can you log the error on Ubuntu using this guide?

I've started implementing local automated testing for SSH, but this is not running yet.

Also, may I suggest an option to disable the reverse SSH tunneling when running on localhost/127.0.0.1

I'm undecided whether local SSH tests should include the tunnel or not. Travis could just set up the tunnel, but R CMD check will never have the keys set up. Might be a good idea to not tunnel for testing, even if the real local processing should always be done with multicore.

HenrikBengtsson commented 5 years ago

After adding

options(clustermq.ssh.log = "~/ssh_proxy.log")

to my ~/.Rprofile, and then running:

$ R --vanilla
[...]
> options(clustermq.scheduler="ssh", clustermq.ssh.host="localhost")
> library(clustermq)
> y <- Q(identity, 42, n_jobs=1)
Sending common data ...
Submitting 1 worker jobs (ID: 6978) ...
Running 1 calculations (1 calls/chunk) ...
^C
Error in rzmq::poll.socket(list(private$socket), list("read"), timeout = msec) : 
  The operation was interrupted by delivery of a signal before any events were available.
> traceback()
5: rzmq::poll.socket(list(private$socket), list("read"), timeout = msec)
4: qsys$receive_data(timeout = timeout)
3: master(qsys = workers, iter = df, rettype = rettype, fail_on_error = fail_on_error, 
       chunk_size = chunk_size, timeout = timeout)
2: Q_rows(fun = fun, df = df, const = const, export = export, seed = seed, 
       memory = memory, template = template, n_jobs = n_jobs, job_size = job_size, 
       rettype = rettype, fail_on_error = fail_on_error, workers = workers, 
       log_worker = log_worker, chunk_size = chunk_size, timeout = timeout)
1: Q(identity, 42, n_jobs = 1)
> 
$ cat ~/ssh_proxy.log
R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> clustermq:::ssh_proxy(ctl=50886, job=54842)
master ctl listening at: tcp://localhost:50886
forwarding local network from: tcp://hb-x1:9720
sent PROXY_UP to master ctl
received common data:function (x) x
sent PROXY_READY to master ctl
received: PROXY_CMDqsys$submit_jobs(n_jobs = 1)

It looks like the worker does indeed launch and manages to communicate with the parent process, but then they get stuck.

mschubert commented 5 years ago

Note to myself: I think this is because submit_jobs again uses the same setup when running on localhost, i.e. again tries to establish an SSH connection.

Not sure how to handle this yet. - Probably best to disable "SSH" qsys from ssh_proxy (multi-hops should probably be handled by SSH itself and not clustermq).

mschubert commented 5 years ago

Basic SSH connections are now tested on Travis (#136)