mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Report early worker failures back to master (and warn on possible R base ABI mismatch) #155

Closed NathanSkene closed 4 years ago

NathanSkene commented 5 years ago

Hi,

Firstly, many thanks for developing this great piece of software.

I've just been working on setting it up to connect via SSH to a SLURM cluster. I was finding that it would hang on "Sending common data ...". After checking the logs on the server I saw:

n*** Successfully loaded .Rprofile ***n
> clustermq:::ssh_proxy(ctl=54382, job=54802)
master ctl listening at: tcp://localhost:54382
forwarding local network from: tcp://longleaf-login2:9655
sent PROXY_UP to master ctl
Error in unserialize(ans) : 
  cannot read workspace version 3 written by R 3.6.0; need R 3.5.0 or newer
Calls: <Anonymous> -> <Anonymous> -> unserialize
Execution halted

Might be worth adding a script to check the version of R first, and enable it to fail more gracefully if the wrong version of R is loaded?

mschubert commented 5 years ago

Thank you for flagging this!

What I want to fix generally is that if the worker encounters an early error, it still sends this back to the master loop.

In this particular case, I'll see what I can do. It is a bit more complicated because

  1. Serialization is the backbone of all master-worker communication. If this breaks, we can not send messages, so sending an error message will also not work
  2. Some changes are just not documented between R versions. For instance, this is a breaking change on a stable (>=1.0) API, so it should occur only on major (x.0.0) version bumps. Yet, R routinely breaks functionality in minor versions, and it is impossible to anticipate which or when

So maybe the best approach is to display a warning if the SSH R has a different major or minor version than the master process.

Can you check if a message is serialized with your R<3.5.0, can it be unserialized with R>=3.6.0?

i.e.:

saveRDS(serialize(1:10, NULL), "test.rds") # on your server w/ R<3.5.0
unserialize(readRDS("test.rds")) # on your local machine with current R
NathanSkene commented 5 years ago

Thanks for looking into it!

I ran this line on the server running R 3.4.1:

saveRDS(serialize(1:10, NULL), "test.rds")

Downloaded the file and opened in locally with R 3.6.0. It worked fine.

mschubert commented 4 years ago

This will be addressed by #150.