signaturescience / focustools

Forecasting COVID-19 in the US
https://signaturescience.github.io/focustools/
GNU General Public License v3.0
0 stars 0 forks source link

configure bootstrapping script for aws ec2 instance #23

Closed vpnagraj closed 3 years ago

vpnagraj commented 3 years ago

part of the planned pipeline automation will be a pre-configured ec2 instance with the code/data necessary to generate a weekly forecast. all of this data can be provisioned programmatically with a bootstrapping script

some of the sub-tasks here:

note: a bootstrapping script is passed as "user data" on boot. what that means is we stop the instance and start it again, the script will not be run on restart. so if we pack everything (commands to install focustools, intitate wrapper bash script to generate forecasts, and then a command to shut down the instance) then we need to completely terminate the instance and reboot each time. which is fine. just need to keep that in mind moving forward.

example of bootstrapping script from another project:

https://github.com/signaturescience/kube/blob/master/launch/kube-bootstrap.sh

vpnagraj commented 3 years ago

ive done the work here.

not sure the best place to actually store the bootstrap script. but when launched as "user data" with an EC2 instance, the bootstrap script below will:

  1. Install all software deps
  2. Install focustools package from prebuilt tarball hosted on S3 (placeholder {BUCKET-NAME} in script)
  3. Run the analysis pipeline (see focus-pipeline.R at bottom of comment, for example)
  4. Copy the results file to an S3 bucket S3 (placeholder {BUCKET-NAME} in script)

a few notes:

focus-boostrap.sh

#!/bin/bash

## install R
sudo apt-get update

sudo apt-get install -y \
  r-base \
  git \
  libcurl4-openssl-dev \
  gdebi-core \
  libssl-dev \
  libsasl2-dev \
  libxml2-dev \
  libcairo2-dev \
  pandoc \
  zlib1g-dev \
  python3-pip \
  python3-venv \
  python3-setuptools

## install aws cli
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

## install miniconda for validation process
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda

## install virtualenv for validation process
sudo -H pip3 install virtualenv

## install packages
sudo R -e "install.packages(c('magrittr','dplyr','readr','tidyr','yaml','glue','fabletools', 'fable','tsibble','MMWRweek','httr','lubridate','purrr','reticulate','tibble','lifecycle','devtools','here','feasts','urca'), INSTALL_opts = '--no-lock')"

## run virtualenv create so validation will work
sudo R -e "reticulate::virtualenv_create()"

## get the package
aws s3 cp s3://{BUCKET-NAME}/focustools_0.0.0.9000.tar.gz .

## and install it
R CMD INSTALL focustools_0.0.0.9000.tar.gz

## get the pipeline script
aws s3 cp s3://{BUCKET-NAME}/focus-pipeline.R .

## and run it ...
mkdir submissions
Rscript focus-pipeline.R

## write the sytem log so we can debug if need be
cat /var/log/syslog > "submissions/$(date +'%Y_%m_%d_%I_%M_%p').log"

## copy the submission file
aws s3 cp --recursive submissions s3://{BUCKET-NAME}/submissions

## stop the instance!
shutdown -h now

focus-pipeline.R

library(focustools)
library(dplyr)
library(readr)
library(purrr)
library(fabletools)
library(fable)

## get data at the national scale from jhu source
usac <-  get_cases(source="jhu",  granularity = "national")
usad <- get_deaths(source="jhu",  granularity = "national")

## use the focustools helper to prep the tsibble format
usa <-  
  dplyr::inner_join(usac, usad, by = c("epiyear", "epiweek","location")) %>% 
  make_tsibble(chop=TRUE)

fit.icases <- usa %>% model(arima = ARIMA(icases, stepwise=FALSE, approximation=FALSE))
fit.ideaths <- usa %>% model(linear_caselag3 = TSLM(ideaths ~ lag(icases, 3)))

## generate incident case forecast
icases_forecast <- ts_forecast(fit.icases, outcome = "icases", horizon = 4)
icases_forecast

## need to get future cases to pass to ideaths forecast
future_cases <- ts_futurecases(usa, icases_forecast, horizon = 4)
# Forecast incident deaths based on best guess for cases
ideaths_forecast <- ts_forecast(fit.ideaths,  outcome = "ideaths", new_data = future_cases)
ideaths_forecast

## generate cumulative forecasts
cdeaths_forecast <- ts_forecast(outcome = "cdeaths", .data = usa, inc_forecast = ideaths_forecast)
cdeaths_forecast

## create submission object
submission <-
  list(format_for_submission(icases_forecast, target_name = "inc case"),
       format_for_submission(ideaths_forecast, target_name = "inc death"),
       format_for_submission(cdeaths_forecast, target_name = "cum death")) %>%
  reduce(bind_rows) %>%
  arrange(target)

submission

## set up file path for submission file
## forcing the date in the file name and submission conents to be this monday
ffile <- file.path(paste0("submissions/", Sys.Date(), "-SigSci-TS.csv"))
submission %>%
  #mutate(forecast_date = this_monday()) %>%
  write_csv(ffile)

validation <- validate_forecast(ffile, install=TRUE)
write_lines(validation, path = paste0("submissions/", Sys.Date(), "-validation.txt"))
vpnagraj commented 3 years ago

leaving this open while i address the last note:

the boot process should write submission csv, a log file AND a validation text file (but that's not quite working yet because of python installation problems ...)

if i can fix that i will adjust the code in the comment above accordingly.

vpnagraj commented 3 years ago

this is fully working now.

i added the following line to the bootstrap script above:

## run virtualenv create so validation will work
sudo R -e "reticulate::virtualenv_create()"

that call seeds the python installation with a virtualenv (defaults to r-reticulate) so that reticulate has somewhere to install packages (which it needs to do when it eventually calls validate_forecast() in the R pipeline script)