Peer-to-peer R cluster?

HenrikBengtsson commented 8 years ago

I'm just throwing this one out there:

What would it take to implement a peer-to-peer cluster for evaluating R code?

Examples:

Alice has access two machines at home, a remote compute cluster at work, and a free Amazon AWS machine. She wish to utilize all resources from the R prompt on her notebook.
Alice and Bob are friends living in different time zones. Bob's machine is often idle when Alice work and vice verse. How can they merge their computers into a peer-to-peer R cluster?
Alice, Bob and Carol are collaborators on a scientific project. The large raw data set they plan to analyze is on Carol machine and right now they're only prototyping an analysis together. Instead of transferring the data to them, Alice and Bob evaluates the R code that access the data on Carol's machine but do the rest locally.

Random things that need to be considered:

It should be extremely easy to setup and install.
Allowed peers: white-/blacklisting, authentication, ...
Trusted and non-trusted peer-to-peer clusters.
Resources shared: number of cores, RAM, disk space, bandwidth (e.g. serialized input/output objects), data files, ...
Exception handling: time outs, dropped connections, interrupted/terminated compute nodes, ...
Shared storage: cloud disk space, deposit jobs, deposit results, ...
...?

Existing frameworks that could possibly be utilized:

in R:
- remoter package
- OpenCPU, opencpu package
- rrqueue
- queuer
- BatchJobs
- ...?
outside of R:
- HTCondor
- BitTorrent
- etcd
- ...?
Miscellaneous:
- ...?

EDIT: Added links mention in comments below to this initial post. /HB 20160308

richfitz commented 8 years ago

If R is the right layer to run this, then I wonder if some of @jeroenooms's OpenCPU things can work here.

Otherwise something like Condor sitting in the middle may help.

In either case the security issues probably need dealing with, though I expect this will always need to be considered as essentially complete access to the other machine.

I'm working on some things to help recreate environments on remote computers (sets of packages, R scripts, dependent files, etc) which might potentially be helpful in this situation.

jeroen commented 8 years ago

This paper is interesting too: http://arxiv.org/abs/1412.6890

leeper commented 8 years ago

There's probably something useful in remoter, as well.

HenrikBengtsson commented 8 years ago

@richfitz, "Condor" == HTContor (named Condor 1988-2012), correct?

Another though in order to get a a first poor-mans version going:
Ignoring firewall issues, one could use a basic ad-hoc cluster setup using parallel::makePSOCKcluster() in the background with ssh key-based access (where receiving end gives access to Rscript by specifying something like:

command="/path/to/Rscript -e 'parallel:::.slaveRSOCK()' MASTER=<user external IP> PORT=11708 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE" ssh-rsa <long key of remove user>

in the ~/.ssh/authorized_keys on the compute node for each user to use that node. This would of course rely on trust, but in several use cases this is what is good enough. Maybe one can locate public keys from, say, users' GitHub accounts (e.g. https://github.com/HenrikBengtsson.keys) and/or elsewhere in order to auto-generate the above entry.

Depending on what latency is accepted (think jobs running for > 5 minutes), another scheme would be to have a shared central online repository where "jobs" are deposited and are collected/operated on a volunteering basic. This would require much more polling, but could avoid lots of firewall issues. Can Dropbox, Google Drive etc. be used for this? What about the BitTorrent peer-to-peer file sharing protocol?

(I've updated top comment with links mentioned in the thread)

richfitz commented 8 years ago

Yeah, that's the one!

For the adhoc bits, what you describe sounds a bit like what I have set up previously with rrqueue and am currently working on queuer. The latter currently uses a driver compatible with filesystem polling. Alternatively the BatchJobs package overlaps significantly with some of what is needed here (I believe you are familiar with that last package :stuck_out_tongue_closed_eyes:). (The reason why queuer exists is to abstract over ways of storing information and polling (db/filesystem) and to work with a lightweight way of describing how to recreate an environment on a remote machine).

Another option for nodes finding each other would be to use some combination of things like etcd which @sckott wrote a package to interface with R and then use a discovery approach to find other nodes for example

richfitz commented 8 years ago

Related: file sync across those nodes is going to be hard. This is an issue for any job that depends on data. Spin out into a separate issue?

HenrikBengtsson commented 8 years ago

This one is really growing with lots of interesting challenges, but that also makes it the more fun too.

About file sync across nodes: Yes, this could probably be a topic by itself. Is rsync() a good name? ;)

In the most simple setup of an R cluster, I envision that streaming of serialized objects could facilitate most needs of communication, including arguments, values and graphics.

ropensci / unconf16

Peer-to-peer R cluster? #21