Open HenrikBengtsson opened 8 years ago
If R is the right layer to run this, then I wonder if some of @jeroenooms's OpenCPU things can work here.
Otherwise something like Condor sitting in the middle may help.
In either case the security issues probably need dealing with, though I expect this will always need to be considered as essentially complete access to the other machine.
I'm working on some things to help recreate environments on remote computers (sets of packages, R scripts, dependent files, etc) which might potentially be helpful in this situation.
This paper is interesting too: http://arxiv.org/abs/1412.6890
@richfitz, "Condor" == HTContor (named Condor 1988-2012), correct?
Another though in order to get a a first poor-mans version going:
Ignoring firewall issues, one could use a basic ad-hoc cluster setup using parallel::makePSOCKcluster()
in the background with ssh key-based access (where receiving end gives access to Rscript
by specifying something like:
command="/path/to/Rscript -e 'parallel:::.slaveRSOCK()' MASTER=<user external IP> PORT=11708 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE" ssh-rsa <long key of remove user>
in the ~/.ssh/authorized_keys
on the compute node for each user to use that node. This would of course rely on trust, but in several use cases this is what is good enough. Maybe one can locate public keys from, say, users' GitHub accounts (e.g. https://github.com/HenrikBengtsson.keys) and/or elsewhere in order to auto-generate the above entry.
Depending on what latency is accepted (think jobs running for > 5 minutes), another scheme would be to have a shared central online repository where "jobs" are deposited and are collected/operated on a volunteering basic. This would require much more polling, but could avoid lots of firewall issues. Can Dropbox, Google Drive etc. be used for this? What about the BitTorrent peer-to-peer file sharing protocol?
(I've updated top comment with links mentioned in the thread)
Yeah, that's the one!
For the adhoc bits, what you describe sounds a bit like what I have set up previously with rrqueue and am currently working on queuer. The latter currently uses a driver compatible with filesystem polling. Alternatively the BatchJobs package overlaps significantly with some of what is needed here (I believe you are familiar with that last package :stuck_out_tongue_closed_eyes:). (The reason why queuer exists is to abstract over ways of storing information and polling (db/filesystem) and to work with a lightweight way of describing how to recreate an environment on a remote machine).
Another option for nodes finding each other would be to use some combination of things like etcd which @sckott wrote a package to interface with R and then use a discovery approach to find other nodes for example
Related: file sync across those nodes is going to be hard. This is an issue for any job that depends on data. Spin out into a separate issue?
This one is really growing with lots of interesting challenges, but that also makes it the more fun too.
About file sync across nodes: Yes, this could probably be a topic by itself. Is rsync()
a good name? ;)
In the most simple setup of an R cluster, I envision that streaming of serialized objects could facilitate most needs of communication, including arguments, values and graphics.
I'm just throwing this one out there:
What would it take to implement a peer-to-peer cluster for evaluating R code?
Examples:
Random things that need to be considered:
Existing frameworks that could possibly be utilized:
EDIT: Added links mention in comments below to this initial post. /HB 20160308