ropensci / auunconf

repository for the Australian rOpenSci unconference 2016!
18 stars 4 forks source link

R package for automatically spinning up a cluster on Amazon EC2 for use with SNOW/Snowfall #12

Open dpagendam opened 8 years ago

dpagendam commented 8 years ago

This is an idea for an R package that I started thinking about some time ago and I have made a good start on some scripts that could be used as a basis for this. Essentially, the idea would be to create an R package that would allow researchers to easily use other R packages like SNOW and Snowfall (for parallel computing on a cluster) on an Amazon Web Services (AWS) Elastic Compute (EC2) platform. The package would aim to make the process of spinning up a cluster on AWS as simple as possible, with functions that take care of this process and return the IP addresses of the workers. These in turn can then we handed to SNOW/Snowfall for easy parallel computation.

benmarwick commented 8 years ago

Yes, good idea, there's a lot of scope for simplifying the use of AWS with R. There's some prior discussion here: https://discuss.ropensci.org/t/r-interface-for-aws-services/215/5, an archived CRAN package (https://cran.r-project.org/src/contrib/Archive/AWS.tools/), and something on github (https://github.com/armstrtw/Rawscli).

MilesMcBain commented 8 years ago

Nice idea! @tierneyn and I were discussing a related idea recently, which was an R package to simplify the interface with batch processing HPC systems like most universities have. Different styles of processing but the higher level issues are the same. I would be keen to participate in a discussion about the design of this type of package (the one you have proposed).

dpagendam commented 8 years ago

Great to hear that there is some interest in this idea! Thanks for the links @benmarwick - as I understand it, AWS.tools is now defunct, but Rawscli should be useful in providing easy access to the AWS Command Line Utility (CLI). Good find!

@MilesMcBain it sounds like these are very closely related ideas so it'd be great to keep you and @tierneyn in the discussion and to think about whether we can generalise it to clusters other than just AWS.

benmarwick commented 8 years ago

I just spotted the cloudyr project and its AWS EC2 Client Package. This claims to be "a simple client package for the Amazon Web Services (AWS) Elastic Cloud Compute (EC2) REST API, which can be used to monitor use of AWS web services"

mattwatts commented 8 years ago

I'm interested in similar functionality for Nimrod clusters. We're using Nimrod clusters to provide scalable computing in web apps.

ghost commented 8 years ago

Louis Aslett maintains an RStudio server Amazon Machine Image, with many common tools and dependencies built-in (e.g. Dropbox integration, LaTeX, ODBC drivers, OpenGL, BLAS, RStan, etc). Launching an instance only takes a couple of minutes. Once it's launched, the RStudio IDE can be accessed from the browser with a simple login. I imagine one could then start using the snow/snowfall packages without any further setup?

dfalster commented 8 years ago

I was very sorry to miss the auunconf event, but hopefully next time. Anyway, good news is that over the last 2 years, I've been working in a project that has among other things built tools that enable the type of work flows being discussed in this and related #34 thread. In particular:

  1. Easily spinning up an AWS cluster
  2. Uploading and queuing R jobs

I posted some more information about these in the related thread at #34.