mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

clustermq.ssh.timeout #175

Closed mhesselbarth closed 4 years ago

mhesselbarth commented 4 years ago

I had some troubles with the SSH connection lately and found that there might be an option (clustermq.ssh.timeout) available. However, I couldn't find any documentation on how and where to set it. I guess the local .Rprofile should do the trick?

Any help would be highly appreciated.

mhesselbarth commented 4 years ago

I tried to increase all timeout settings, but still getting the error that the workers reach a timeout and are terminated.

Example from my log file:

Erron in clustermq:::worker("tcp://gwdu103:8922")
   Timeout reached, terminating
mschubert commented 4 years ago

The documentation is in this PR, but I haven't deployed the update on the web page yet (because it's not yet released).

What does your SSH log say?

Note that clustermq.ssh.timeout is for SSH startup, while the worker timeout is likely during runtime.

Are you transferring large amounts of data over SSH? This could be one reason. Or, if your SSH gets disconnected altogether (which may be solvable by changing the default timeouts).

mhesselbarth commented 4 years ago

Thank you very much for your help.

I think the reason was that the data transfered over SSH was too large (about 1 GB).

mattwarkentin commented 3 years ago

@mschubert If the data being transferred via SSH are larger, lets say 1GB+, is there a way to increase the worker timeout? I never have an issue with the SSH startup, but sometimes the data I'm sending is on the larger side, and my workers time out.

Other than not sending large data over SSH, any suggestions for how to work around any timeout issues?

mschubert commented 3 years ago

@mattwarkentin You can set clustermq.worker.timeout.

I still need to document the options better

mattwarkentin commented 3 years ago

My attempt in #218