R package for packaging workloads, uploading and running on AWS

ropensci / auunconf

repository for the Australian rOpenSci unconference 2016!

18 stars 4 forks source link

R package for packaging workloads, uploading and running on AWS #34

Open Daniel-t opened 8 years ago

Daniel-t commented 8 years ago

In past projects I've often found myself constrained by the resources of my local machine and wondered why can't I package up my scripts and data (similar to publishing a shiny app) upload it to Amazon, start an EC2 instance start and run my job, then download the result.

Of course I could do this manually, but that would get tedious, I often work offline so I can't work in AWS full time, and I only need the full compute power/memory of an EC2 instance until I'm ready to work on the full dataset.

So my proposal is an R package which:

packages up data and scripts
uploads it to Amazon
starts a server with an appropriate configuration
runs the package and saves the output somewhere
stops the instance

There may be some points of convergence with #12.

jonocarroll commented 8 years ago

I find RStudio server, running on an permanent AWS instance, works pretty well for this. I have a script to send a script/data to an S3 bucket, then it's a matter of logging into RStudio via a browser and hitting source/go. Or the whole thing via command line. I'm not sure what more you would automate.

On Thursday, 7 April 2016, Daniel-t notifications@github.com wrote:

In past projects I've often found myself constrained by the resources of my local machine and wondered why can't I package up my scripts and data (similar to publishing a shiny app) upload it to Amazon, start an EC2 instance start and run my job, then download the result.

Of course I could do this manually, but that would get tedious, I often work offline so I can't work in AWS full time, and I only need the full compute power/memory of an EC2 instance until I'm ready to work on the full dataset.

So my proposal is an R package which:

packages up data and scripts

uploads it to Amazon

starts a server with an appropriate configuration

runs the package and saves the output somewhere

stops the instance

There may be some points of convergence with #12 https://github.com/ropensci/auunconf/issues/12.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ropensci/auunconf/issues/34

MilesMcBain commented 8 years ago

I can see where @Daniel-t is coming from. Although it's easy to get R up on AWS that's really just the start. If I wanted to get set up for a big task on a meaty instance, and minimise costs, I would go about it like this:

Create a micro instance from Louis Aslett's AMI
Install package dependencies (could take a little while)
Upload code + data
Do a small scale test runs to make sure dependencies setup correctly
Create an image of instance (can take a while)
Deploy image on meaty compute node
Run job + Save results.
Stop meaty node ($$$)

I've never run a cluster job, but I imagine it would work similarly except with a second phase of testing on multiple micros after the imaging.

I played around with Docker/Rocker on the weekend and it struck me that this could really cut down the effort to deploy big tasks. If there were some way I could dockerize an r-project with all its decencies from my dev environment, then I could do testing locally and just deploy the docker image directly on AWS. See: EC2 Container Service. This is something of use in the context of #12 as well.

With this ability in hand, we could then focus our automation efforts on the ECS API to facilitate easy deployment/cleanup of containers from within R. Possible extensions are deployment on other cloud computing services.

jonocarroll commented 8 years ago

If you're looking to automate things, AWS have a command-line interface that is quite powerful, and if setup correctly, seems able to handle instance deployment.

I use the aws-cli to manage a free micro instance, but presumably it can work to manage starting/stopping also.

MilesMcBain commented 8 years ago

There's an R wrapper being written for that interface that seems to be coming along, as mentioned in #12:(https://github.com/cloudyr).

I did a bit of poking around re docker and found this: dockertest which may be a good start for creating containers from R projects.

On that same jaunt I discovered the Google Cloud Compute platform which also supports containers and has some different pricing models (maybe better?).

Daniel-t commented 8 years ago

I'm not sure docker would be the correct technology in this case. In the majority of situations you'd only be looking at running a single application on a server (e.g. 'R'), adding docker would be another layer of abstraction which I cant see providing much value over a straight server instance.

Load/create code + data locally (or whereever)
Do a small scale test runs to make sure dependencies setup correctly
Package everything up (similar to the way shiny apps are created)
upload to S3
Deploy a generic R server image on meaty compute node
Automatically Run job + Save results back to S3.
Stop meaty node ($$$)
read results and logs back into the local R

Regarding @MilesMcBain comment around process flow, I was thinking something more streamlined again that didn't require the prework to be done on a cloud (although it could be)

I based this proposal on AWS because it's what I'm most familiar with, ideally the solution should be extensible to other platforms (e.g. MS Azure, Google, etc).

For the (distant) future, it would also be nice to be able to able to leverage AWS spot instances (cheap excess capacity) for running jobs. For example an AWS server with 32 vCPUs and 60GB of memory (c3.x8large) normally costs U$2.117 an hour (in Sydney), however a spot price for the same is presently $0.3234 per hour. There are issues with Spot instances, in that they get stopped if the spot price goes over your current bid, however for some pieces of analysis they would serve nicely.

MilesMcBain commented 8 years ago

Running a single application on a server is the canonical docker use-case. See: Dockerfile Best Practices.

However if we don't want to muddy the waters with Docker right off that bat, at the very least we would want to ensure r package dependencies are migrated to the cloud automatically to save us setting them up. Packrat will be of use for this.

@Daniel-t , you make a great point about spot instances. Google have similar concept with their pre-emptable instances. It would be cool if the API that helps you deploy your R project in the cloud could also help you make it resistant to being terminated before job completion. I think this in itself is a very interesting sub-problem.

dpagendam commented 8 years ago

I like your idea @Daniel-t and I think it aligns well with the AWS/SNOW idea that was proposed previously for the conference. At present I do a lot of work on AWS and had similar thoughts to you: I wanted to be able to spin up EC2 instances using my ssh key, check whether they had launched, export functions to those R instances, give them jobs to do and get back the results (all from R!) with a collection of easy-to-use tools. I have achieved this using a combination of SNOW (you could just have a single AWS worker that is running your jobs or a cluster of workers), shell scripts and AWS CLI scripts, but I think this sort of thing could be packaged up into a much nicer R package than the collection of clunky scripts that I currently have running on my machine. In essence, I build a cluster of AWS workers but where my laptop operates as the head node. I'd love to talk to you and other like-minded cluster-holics about creating something that takes all the hard work out of running big jobs on AWS from R. It'd also be great if this idea could be generalised so that it would be useful to users of other cloud compute systems (Google, Microsoft etc).

jeffreyhanson commented 8 years ago

This sounds really cool. I'm familiar with snow and networking with ssh. I don't have much experience with AWS or Docker though.

Here's just some random thoughts I have.

To streamline it with the basic snow/parallel package, we'll need S3 methods for the following functions: makeCluster (spawn the cluster), clusterEvalQ (load packages; initialise stuff), clusterExport (send objects to the cluster), and stopCluster (kill the cluster).
I agree with @MilesMcBain, from what I've head of Docker--albeit not much--this is exactly what it's for. And if we use the Rstudio Docker image this simplifies the installation of R and other dependencies.
I think it would be great if we also had a registerDoParallel method so we could do run plyr functionals using AWS.
I don't think it would be feasible to use an image with every single R package on CRAN. So we would need an extra function to install packages on the AWS workers. And/or, if we wanted to be super-sneaky, in the docker image, we could overwrite the library and require functions in the AWS workers so that they first check if the library is installed, and if it's missing, install it from the Rstudio CRAN mirror. So for instance if clusterEvalQ(clust; library(ggplot2)) is run and the image doesn't have ggplot2 then ggplot2 is installed on the workers automatically.
Sending/receiving R objects from the cloud. How are the R objects sent to the workers? Are we saving .rds files and ssh'ing those to and from the workers? If so, it might be worth letting users specify a compression level (eg. the compress argument in saveRDS) so if they are working with large datasets then can reduce bandwidth.

dpagendam commented 8 years ago

The biggest issue I have with my current approach (and one that I'm not network savy enough to have overcome so far) is that to operate a cluster in this way with my local machine as the head node, I have to open ports on my firewall so that the SNOW workers can send back their results. Obviously, this is fine when I am at home and am the network administrator, but doesn't work when I am at work. I'm wondering if anyone has any bright ideas (maybe using a reverse ssh tunnel for example) on how to overcome the firewall problem for SNOW without the need to forward ports?

Daniel-t commented 8 years ago

My original thoughts around this proposal (note I was thinking single instances, not clusters) was to:

upload the script & data to a location near where the servers would be (e.g. S3 on Amazon)
start the server(s)
The servers automatically pull in the data, run the script, save the outputs and runtime messages (for debugging) back to S3 (again in AWS; critical parameters can be passed as userdata)
servers terminate on job completion The idea being to minimise the amount of time using the expensive nodes, the benefits of this approach are:
once started the initiating host can be turned off until after the job is finished and then download the results
If spot instances (or other discounted options) are used, they can start when the price is right.
No need to mess around with SSH, keys etc

Regarding spot instance resiliency mentioned by @MilesMcBain, AWS recently introduced classes of spot instances which are guaranteed to run for a certain period of time (from memory: 1/3/6 hours), the other options for being resistant to termination are 'bidding' at a higher price than everyone else (you only pay the minimum required to get your instance) or having some checkpointing in the code so that if the server does get terminated it can restart with minimal losses.

I'm not familiar with SNOW and only passing familiarity with Docker (I wrote my first docker file last weekend).

Regarding the use of Docker in this use case, I'm still not convinced that it provides much benefit on a server that is intended to exist only for the duration of the job, however I don't see it as being a hindrance so happy to go with thoughts of the group. (FYI, installing rStudio with dependencies on most linux distributions is two commands).

dfalster commented 8 years ago

I was very sorry to miss the auunconf event, but hopefully next time. Anyway, good news is that over the last 2 years, I've been working in a project that has among other things built tools that enable the type of work flows being discussed in this thread. In particular:

Easily spinning up an AWS cluster
Uploading and queuing R jobs

To spin up the cluster, we use a new tool called clusterous, developed by a team at SIRCA. The aim we set the team at SIRCA was to make this process easier for scientists, and enable them to easily upload their code, data and retrieve results. Clusterous is language agnostic, can be used with any number of workflows. To enable R-based workflows, @richfitz developed a couple of tools:

dockertest -- to build a docker container for your R project. On running clusterous, this gets loaded onto the AWS worker machines. The benefit of bringing your own docker container is that you can build and test workers locally, before pushing it up.
rrqueue -- allows you to queue jobs from your laptop onto the AWS cluster.

Other features include logging of output, ability to query task run times, etc.

It's all still a little rough, but we have now successfully used the tools and workflow in several projects. @jscamac and I will hopefully put together a minimal demo in the near future, demonstrating how they all come together.

I'd be keen to hear from anyone who is interested in using these tools, what features might be missing, and whether we might build on this basis in future unconf events.