ropensci / ozunconf17

Website for 2017 rOpenSci Ozunconf
http://ozunconf17.ropensci.org/
24 stars 6 forks source link

Lambda-boosted genetic algorithms #5

Open jeffreyhanson opened 7 years ago

jeffreyhanson commented 7 years ago

Lambda-boosted genetic algorithms

Summary

We encounter problems in our every day lives. These problems can be relatively simple to solve (e.g. finding the smallest element in a set), or they can be far more difficult to solve (e.g. travelling sales person problem). One approach for finding near-optimal solutions to complex problems is using genetic algorithms. Genetic algorithms solve problems by simulating a population of candidate solutions and natural selection processes to find a good solution. Each potential solution is evaluated using an objective function which returns a single value that measures the quality of the solution. These algorithms have been used to tackle a range of problems in scientific and commercial contexts.

Genetic algorithms have been implemented in the GA R package. This package is fantastic and offers enormous functionality. However, large-scale problems can take a very long time to solve because evaluating a complex objective function for a single candidate solution can take a few minutes, and solving these problems can involve 10'000+ evaluations. Although the GA R package offers the functionality to solve problems using parallel processing---so that multiple candidate solutions can be evaluated simultaneously---researchers often have limited access to compute resources. As a consequence, there has been increasing interest in using cloud resources to overcome limited compute resources.

I propose using Amazon Web Services (AWS) serverless compute infrastructure to overcome this limitation. Simply put, one can use AWS Lambda to execute thousands of R commands in parallel for a relatively small cost (e.g. approx. 900 compute hours free per month with 128 MB RAM). The catch? Each R command must complete within 5 minutes and the compute resources are extremely limited (i.e. 1536 MB maximum RAM, total disk space 512 MB for R and additional packages). In the context of genetic algorithms, the time-based limitation means that a single evaluation of the objective function must complete within 5 minutes---this is fairly reasonable in my line of work. The main trick will be setting up a minimal R environment. Fortunately, there has already been substantial progress made towards setting up R on AWS Lambda, so this project could utilise existing methods for achieving this. We will also probably need come up with strategies to reduce package installation sizes. To deal with moderately sized data, we could look into using Amazon's cloud storage service (S3) to host data and load this dynamically each time the service is called.

Proposed development

It would be great if we could develop an R package that makes it easy for people to execute code using R code using AWS Lambda. However, I feel that this goal may be too ambitious given our time constraints. Therefore, I propose that we work on developing the following functions and see how we go.

#' Install pruned package
#'
#' @param x \code{character} vector with package names
#'   to install.
#'
#' @return None.
#'
#' @seealso \code{\link[utils]{install.packages}}.
#'
install_pruned_package <- function(x) {
  # insert magic here
  return(NULL)
}

#' Initialise AWS Lambda with new R environment
#'
#' @param key \code{character} Amazon Web Services key.
#'
#' @param packages \code{character} vector with package names
#'   to install.
#'
#' @param varlist \code{character} vector with object names
#'   to save and automatically load in the R AWS Lambda environment.
#'
#' @details A minimal version of each package is installed
#'   to reduce footprint. This includes removing documentation
#'   and reducing the size of binary objects following Dirk 
#'   Eddelbuettel's advice (http://dirk.eddelbuettel.com/blog/
#'   2017/08/14#009_compact_shared_libraries).
#'
#' @seealso This function is intended to be an AWS Lambda equivalent
#'   for \code{link[parallel]{makeCluster}} and 
#'   \code{link[parallel]{clusterExport}}. The 
#'   \code{\link{install_pruned_package}} function is used 
#'   to install a minimal version of packages in argument to 
#'   \code{packages}.
#'
#' @return \code{character} URL for AWS Lambda
makeLambda <- function(key, packages = NULL, varlist = NULL) {
  # insert magic here
  return(NULL)
}

#' Execute code on R AWS lambda environment
#'
#' @param url \code{character} for AWS Lambda service with
#'   R environment initialised.
#'
#' @param expr \code{expression} that should be evaluated
#'    in the R AWS Lambda environment.
#'
#' @return The output from the service.
#'
#' @seealso This function is intended to be an AWS Lambda equivalent
#'   for \code{link[parallel]{clusterEvalQ}}.
#'
lambdaEvalQ <- function(url, expr) {
  # insert magic here
  return(result)
}

Proposed usage

To put the above functions in context, I've put together an example using genetic algorithms to solve a simple 1-dimensional problem (via GA::ga), and an example that uses the previous functions to solve the same problem using AWS Lambda. Note that this example doesn't utilise any additional packages to keep things simple.

## create optimization problem data
# create objective function
obj <- function(x) -(abs(x)+cos(x))

# plot curve 
# we want to find the value for x that gives the highest y
curve(obj, -20, 20)

## example using the GA package
result1 <- GA::ga(type = "real-valued", fitness = obj, min = -20, max = 20)

## example using AWS Lambda
# note that "XXXXXXXXXXX" is used to represent a valid AWS key
url <- makeLambda("XXXXXXXXXXX", varlist = c("obj"))
obj_wrapper <- function(x) lambdaEvalQ(url, obj(x))
result2 <- ga(type = "real-valued", fitness = obj_wrapper, min = -20, max = 20)

I would love to hear back if any one has any ideas, suggestions, or comments for improving this project proposal.

Useful resources

njtierney commented 6 years ago

@jeffreyhanson this looks really great, great job at making such a thorough ozunconf idea. Do you happen to know if there is an example problem + dataset that could be used for this problem? Or would that be the "case-study for using AWS lambda in scientific contexts."?

jeffreyhanson commented 6 years ago

Yeah, we could find a benchmark data set online and compare standard genetic algorithm performance to genetic algorithms that use AWS Lambda to reduce run time. I think the broader appeal is using AWS Lambda to farm out lots of small batch processing jobs.

timchurches commented 6 years ago

I'm very keen to explore ways to use AWS Lambda from R - it's an incredibly cheap way to harness huge amounts of computing power, if you can work out how to fit your problem within the constraints it deliberately imposes.

stephstammel commented 6 years ago

I'd be very keen to help out with this in some capacity - I've been looking at GAs lately and think they have enormous capacity that's currently under-utilised in my field.

luca-scr commented 6 years ago

Hi, I'm the developer of the GA R package. Your proposal looks very interesting and may widen the application of GAs in several domains. I have never used AWS before, but if there is something to do on GA side let me know.