rstudio / vetiver-r

Version, share, deploy, and monitor models
https://rstudio.github.io/vetiver-r/
Other
184 stars 28 forks source link

Integration with best model from agua #265

Open Colleca opened 11 months ago

Colleca commented 11 months ago

Hello, Im wondering if there is any way to deploy the best performing model from search with agua/h2o package. For example, you might have a workflow that first runs the h2o automl, makes the leaderboard, your code selects the best model from that leaderboard, pins it to a board with vetiver and makes predictions from it?

Please label as feature request if this isnt doable yet, I think it would be really cool!

juliasilge commented 11 months ago

That is a great question @Colleca. As of today, you would need to do some customization of your generated Plumber file as well as your Dockerfile, because H2O uses software beyond only R and system libraries.

Your plumber file would look something like this:

library(pins)
library(plumber)
library(rapidoc)
library(vetiver)
library(h2o)

h2o.init()

b <- board_connect() ## or your board
v <- vetiver_pin_read(b, "model-name")

#* @plumber
function(pr) {
    pr %>% vetiver_api(v)
}

We do already have support for H2O in bundle so we should be able to roundtrip the H2O model to/from disk correctly.

I don't think that H2O is supported on Posit Connect in a very straightforward manner right now because of the Java requirement, but you should be able to build a Dockerfile for some deployment targets. I am not quickly finding any good examples so we might need to get help from the H2O team.

Here are some docs to look at for H2O on SageMaker (docs are only Python, no R).

Colleca commented 11 months ago

thanks for your comment @juliasilge this seems to be a decent starting point where you bundle the h2o automl and then save to the posit connect and you can pull from the server unbundle it and they both make the same prediction

library(tidymodels) library(recipes) library(agua) library(tidyverse) library(h2o) library(bundle) library(pins)

h2o.init()

data(concrete) set.seed(4595) concrete_split <- initial_split(concrete, strata = compressive_strength) concrete_train <- training(concrete_split) concrete_test <- testing(concrete_split)

auto_spec <- auto_ml() %>% set_engine("h2o", max_runtime_secs = 120, seed = 1) %>% set_mode("regression")

normalized_rec <- recipe(compressive_strength ~ ., data = concrete_train) %>% step_normalize(all_predictors())

auto_wflow <- workflow() %>% add_model(auto_spec) %>% add_recipe(normalized_rec)

auto_fit <- fit(auto_wflow, data = concrete_train)

best_model<-bundle(auto_fit)

model_board <- board_connect()

model_board%>%pin_write(best_model,name = "posit/concrete_h2o",type="rds")

read_in_model<-model_board%>% pin_read("posit/concrete_h2o")%>% unbundle()

model_predictions_local<-predict(auto_fit,concrete_test)

model_predictions_saved<-predict(read_in_model,concrete_test)

identical(model_predictions_local,model_predictions_saved)

Colleca commented 11 months ago

i should point out that pinning it with vetiver does produce error so as far as posit connect is concerned it just sees it as an .rds data object not like the awesomeness of the pinned model vetiver object

juliasilge commented 11 months ago

Can you share the error you get when you try to pin with vetiver, i.e. vetiver_pin_write()? I can successfully pin the model to Connect:

library(tidymodels)
library(recipes)
library(agua)
#> 
#> Attaching package: 'agua'
#> The following object is masked from 'package:workflowsets':
#> 
#>     rank_results
library(h2o)
#> 
#> ----------------------------------------------------------------------
#> 
#> Your next step is to start H2O:
#>     > h2o.init()
#> 
#> For H2O package documentation, ask for help:
#>     > ??h2o
#> 
#> After starting H2O, you can use the Web UI at http://localhost:54321
#> For more information visit https://docs.h2o.ai
#> 
#> ----------------------------------------------------------------------
#> 
#> Attaching package: 'h2o'
#> The following objects are masked from 'package:stats':
#> 
#>     cor, sd, var
#> The following objects are masked from 'package:base':
#> 
#>     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
#>     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
#>     log10, log1p, log2, round, signif, trunc
library(vetiver)
#> 
#> Attaching package: 'vetiver'
#> The following object is masked from 'package:tune':
#> 
#>     load_pkgs
library(pins)

h2o.init()
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         4 minutes 57 seconds 
#>     H2O cluster timezone:       America/Denver 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.42.0.2 
#>     H2O cluster version age:    4 months and 18 days 
#>     H2O cluster name:           H2O_started_from_R_juliasilge_aqp711 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   3.12 GB 
#>     H2O cluster total cores:    8 
#>     H2O cluster allowed cores:  8 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     R Version:                  R version 4.3.2 (2023-10-31)
#> Warning in h2o.clusterInfo(): 
#> Your H2O cluster version is (4 months and 18 days) old. There may be a newer version available.
#> Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

data(concrete)
set.seed(4595)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test <- testing(concrete_split)

auto_spec <-
    auto_ml() |>
    set_engine("h2o", max_runtime_secs = 120, seed = 1) |>
    set_mode("regression")

normalized_rec <-
    recipe(compressive_strength ~ ., data = concrete_train) |>
    step_normalize(all_predictors())

auto_wflow <-
    workflow() |>
    add_model(auto_spec) |>
    add_recipe(normalized_rec)

auto_fit <- fit(auto_wflow, data = concrete_train)
v <- vetiver_model(auto_fit, "julia.silge/concrete_h2o")
v
#> 
#> ── julia.silge/concrete_h2o ─ <bundled_workflow> model for deployment 
#> A h2o regression modeling workflow using 8 features

model_board <- board_connect()
#> Connecting to Posit Connect 2023.10.0 at <https://colorado.posit.co/rsc>
model_board |> vetiver_pin_write(v)
#> Writing to pin 'julia.silge/concrete_h2o'
#> 
#> Create a Model Card for your published model
#> • Model Cards provide a framework for transparent, responsible reporting
#> • Use the vetiver `.Rmd` template as a place to start
#> This message is displayed once per session.

Created on 2023-12-13 with reprex v2.0.2

Colleca commented 11 months ago

Thanks Julia, i think i resolved the issue by updating my packages. im no longer getting a error just pinning with the vetiver pin write. On resolving that issue Im noticing a new issue of getting different predictions from the in memory version than the pinned version. Not really sure whats going on. image

``

library(tidymodels) library(recipes) library(agua) library(tidyverse) library(h2o) library(vetiver) library(tictoc) library(pins)

h2o.init()

data(concrete) set.seed(4595) concrete_split <- initial_split(concrete, strata = compressive_strength) concrete_train <- training(concrete_split) concrete_test <- testing(concrete_split)

auto_spec <- auto_ml() |> set_engine("h2o", max_runtime_secs = 120, seed = 1) |> set_mode("regression")

normalized_rec <- recipe(compressive_strength ~ ., data = concrete_train) |> step_normalize(all_predictors())

auto_wflow <- workflow() |> add_model(auto_spec) |> add_recipe(normalized_rec)

auto_fit <- fit(auto_wflow, data = concrete_train)

v <- vetiver_model(auto_fit, "posit/concrete_h2o")

model_board <- board_connect()

> Connecting to Posit Connect 2023.10.0 at https://colorado.posit.co/rsc

model_board |> vetiver_pin_write(v)

read_in_model<-model_board%>%vetiver_pin_read("posit/concrete_h2o")

model_predictions_local<-predict(auto_fit,concrete_test) #make predictions from in session model

model_predictions_vetiver<-predict(read_in_model,concrete_test) #make predictions from read in model

identical(model_predictions_local,model_predictions_vetiver) #they should be the same.

For making API prediction

vetiver_deploy_rsconnect(model_board, "posit/concrete_model",appTitle = "concrete_model")

endpoint <- vetiver_endpoint("theserver/cnct/concrete_model/predict")

apiKey<-"theapikey"

Sending a test observation

test_ob <- concrete_test[1,] tic() predict(endpoint, test_ob, httr::add_headers(Authorization = paste("Key", apiKey))) toc()

tic() predict(auto_fit,test_ob) toc()

``

juliasilge commented 10 months ago

I don't see a difference between predictions from the local and pinned versions of the H2O models:

library(tidymodels)
library(recipes)
library(agua)
#> 
#> Attaching package: 'agua'
#> The following object is masked from 'package:workflowsets':
#> 
#>     rank_results
library(h2o)
#> 
#> ----------------------------------------------------------------------
#> 
#> Your next step is to start H2O:
#>     > h2o.init()
#> 
#> For H2O package documentation, ask for help:
#>     > ??h2o
#> 
#> After starting H2O, you can use the Web UI at http://localhost:54321
#> For more information visit https://docs.h2o.ai
#> 
#> ----------------------------------------------------------------------
#> 
#> Attaching package: 'h2o'
#> The following objects are masked from 'package:stats':
#> 
#>     cor, sd, var
#> The following objects are masked from 'package:base':
#> 
#>     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
#>     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
#>     log10, log1p, log2, round, signif, trunc
library(vetiver)
#> 
#> Attaching package: 'vetiver'
#> The following object is masked from 'package:tune':
#> 
#>     load_pkgs
library(pins)

h2o.init()
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         4 minutes 1 seconds 
#>     H2O cluster timezone:       America/Denver 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.42.0.2 
#>     H2O cluster version age:    5 months and 10 days 
#>     H2O cluster name:           H2O_started_from_R_juliasilge_eqf117 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   3.41 GB 
#>     H2O cluster total cores:    8 
#>     H2O cluster allowed cores:  8 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     R Version:                  R version 4.3.2 (2023-10-31)
#> Warning in h2o.clusterInfo(): 
#> Your H2O cluster version is (5 months and 10 days) old. There may be a newer version available.
#> Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

data(concrete)
set.seed(4595)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test <- testing(concrete_split)

auto_spec <-
    auto_ml() |>
    set_engine("h2o", max_runtime_secs = 120, seed = 1) |>
    set_mode("regression")

normalized_rec <-
    recipe(compressive_strength ~ ., data = concrete_train) |>
    step_normalize(all_predictors())

auto_wflow <-
    workflow() |>
    add_model(auto_spec) |>
    add_recipe(normalized_rec)

auto_fit <- fit(auto_wflow, data = concrete_train)
v1 <- vetiver_model(auto_fit, "julia.silge/concrete_h2o")
v1
#> 
#> ── julia.silge/concrete_h2o ─ <bundled_workflow> model for deployment 
#> A h2o regression modeling workflow using 8 features

model_board <- board_connect()
#> Connecting to Posit Connect 2023.10.0 at <https://colorado.posit.co/rsc>
model_board |> vetiver_pin_write(v1)
#> Writing to pin 'julia.silge/concrete_h2o'
#> 
#> Create a Model Card for your published model
#> • Model Cards provide a framework for transparent, responsible reporting
#> • Use the vetiver `.Rmd` template as a place to start
#> This message is displayed once per session.

v2 <- model_board |> vetiver_pin_read("julia.silge/concrete_h2o")
v2
#> 
#> ── julia.silge/concrete_h2o ─ <bundled_workflow> model for deployment 
#> A h2o regression modeling workflow using 8 features

preds1 <- predict(v1, concrete_test)
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
preds2 <- predict(v2, concrete_test)
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%

identical(preds1, preds2)
#> [1] TRUE

Created on 2024-01-05 with reprex v2.0.2

Could you update your example to use the reprex package? Using reprex makes it easier to see both the input and output, and for us to re-run the code in a local session. Thanks! 🙌