Problem crating a lightgbm model

Keyeoh commented 1 month ago

Hi,

First of all, thank you for your amazing work. I have been able to log a lightgbm model in mlflow using a "crated" function. Problem is, when I load the crated model in a new, clean, session, I have problems as some of the dependencies are not there.

I have started declaring each of the dependencies that were rising errors by hand, to see if I could arrive to a decent compromise, but I am encountering some problems once I arrive to some FFI code.

Say I have a fitted workflow model...

> final_fit
══ Workflow [trained] ═══════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ──────────────────
0 Recipe Steps

── Model ────────────────
LightGBM Model (1174 trees)
Objective: binary
Fitted to dataset with 25 columns
>

...and I try to crate a function for prediction in the following way:

c_model <- crate(
    function(new_obs) get_predictions(model, new_obs),
    model = final_fit,
    get_predictions = rlang::set_env(rchurn2:::get_predictions),
    predict = rlang::set_env(workflows:::predict.workflow),
    is_trained_workflow = rlang::set_env(workflows:::is_trained_workflow),
    validate_is_workflow = rlang::set_env(workflows:::validate_is_workflow),
    check_dots_empty = rlang::set_env(rlang::check_dots_empty),
    ellipsis_dots = rlang::set_env(rlang:::ellipsis_dots),
    ffi_ellipsis_dots = rlang:::ffi_ellipsis_dots,
    caller_env = rlang::set_env(rlang::caller_env)
)

Then, if I try to call it in a clean session, it is giving me some error I just cannot understand:

callr::r(
    function (d, cmod) {
        cmod(d)
    },
    args = list(
        d = splits[['tst']],
        cmod = c_model
    )
)

Error: 
! in callr subprocess.
Caused by error in `.Call(ffi_ellipsis_dots, env)`:
! NULL value passed as symbol address
Type .Last.error to see the more details.
>

Just wanted to know if I am trying to do too much (if it is possible to crate functions depending on FFI code), or if it would be better just to ensure that all needed dependencies are available in the machine that is going to run the inference part. The latter would be the easy and comfortable path, but I just wanted to know, because I think crate is a great tool, and this is not such a corner case.

Thanks a lot in advance.

Gus.

simonpcouch commented 1 month ago

Noting that there are some possibly related conversations at https://github.com/rstudio/bundle/issues/55 and linked issues, though this error seems more crate/rlang-related. Here's a reprex:

library(tidymodels)
library(bonsai)
library(carrier)

fit <- 
  boost_tree("classification", engine = "lightgbm") %>%
  fit(Class ~ A + B, two_class_dat)

fit
#> parsnip model object
#> 
#> LightGBM Model (100 trees)
#> Objective: binary
#> Fitted to dataset with 2 columns

c_model <- crate(
  predict,
  model = fit,
  predict = rlang::set_env(workflows:::predict.workflow),
  is_trained_workflow = rlang::set_env(workflows:::is_trained_workflow),
  validate_is_workflow = rlang::set_env(workflows:::validate_is_workflow),
  check_dots_empty = rlang::set_env(rlang::check_dots_empty),
  ellipsis_dots = rlang::set_env(rlang:::ellipsis_dots),
  ffi_ellipsis_dots = rlang:::ffi_ellipsis_dots,
  caller_env = rlang::set_env(rlang::caller_env)
)

callr::r(
  function(d, cmod) {
    cmod(d)
  },
  args = list(
    d = two_class_dat,
    cmod = c_model
  )
)
#> Error: ! in callr subprocess.
#> Caused by error in `.Call(ffi_ellipsis_dots, env)`:
#> ! NULL value passed as symbol address

^{Created on 2024-05-17 with reprex v2.1.0}

simonpcouch commented 1 month ago

Something like this should do the trick!

library(tidymodels)
library(bonsai)
library(carrier)

fit <- 
  workflow(
    Class ~ A + B,
    boost_tree("classification", engine = "lightgbm")
  ) %>%
  fit(two_class_dat)

fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> Class ~ A + B
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> LightGBM Model (100 trees)
#> Objective: binary
#> Fitted to dataset with 2 columns

c_model <- crate(
  function(new_data, ...) workflows:::predict.workflow(model, new_data, ...),
  model = fit
)

callr::r(
  function(d, cmod) {
    cmod(d)
  },
  args = list(
    d = two_class_dat,
    cmod = c_model
  )
)
#> # A tibble: 791 × 1
#>    .pred_class
#>    <fct>      
#>  1 Class1     
#>  2 Class1     
#>  3 Class2     
#>  4 Class2     
#>  5 Class1     
#>  6 Class2     
#>  7 Class2     
#>  8 Class2     
#>  9 Class1     
#> 10 Class2     
#> # ℹ 781 more rows

^{Created on 2024-05-17 with reprex v2.1.0}

Keyeoh commented 1 month ago

Thanks for your answer, @simonpcouch! It does work out of the box.

Problem is, and maybe I am missing something, I still cannot achieve what I thought at first (this is, having a function containing all of its dependencies, so an R vanilla installation on the target system could run it as-is).

I have serialized the model and the data used in your example:

library(tidymodels)
library(bonsai)
library(callr)
library(carrier)
library(lightgbm)
#> 
#> Adjuntando el paquete: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice

fit <-
    workflow(Class ~ A + B, boost_tree("classification", engine = "lightgbm")) %>%
    fit(two_class_dat)

fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: boost_tree()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> Class ~ A + B
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> LightGBM Model (100 trees)
#> Objective: binary
#> Fitted to dataset with 2 columns

c_model <- crate(
    function(new_data, ...) workflows:::predict.workflow(model, new_data, ...),
    model = fit
)

callr::r(
    function(d, cmod) {
        cmod(d)
    },
    args = list(
        d = two_class_dat,
        cmod = c_model
    )
)
#> # A tibble: 791 × 1
#>    .pred_class
#>    <fct>      
#>  1 Class1     
#>  2 Class1     
#>  3 Class2     
#>  4 Class2     
#>  5 Class1     
#>  6 Class2     
#>  7 Class2     
#>  8 Class2     
#>  9 Class1     
#> 10 Class2     
#> # ℹ 781 more rows

saveRDS(c_model, 'model.rds')
saveRDS(two_class_dat, 'data.rds')

^{Created on 2024-05-19 with reprex v2.1.0}

Then, I have de-serialized them in a new, clean, session, without the packages installed, and I receive an error about installing the dependencies. This seems logical, and of course is more informative than the ffi-related stuff we started with, but makes me think again if what I want to achieve makes any sense.

d <- readRDS('../prueba_crate/data.rds')
m <- readRDS('../prueba_crate/model.rds')
m(d)
#> Error in `check_installs()`:
#> ! This engine requires some package installs: 'lightgbm, bonsai'

^{Created on 2024-05-19 with reprex v2.1.0}

I guess, as a conclusion, that the idea I had in mind is too complex and maybe not worth the effort. As I have a renv.lock and requirements.txt files which I am using in the training phase, I think I can also reproduce the environment for the inference in the same way.

Thanks a lot for your help. Anyway, if you have some tips on the feasibility of such approach, I'm all ears. :) Always happy to learn from you!!! :)

Regards, Gus.

simonpcouch commented 1 month ago

Glad the answer was helpful!

One tool that may be helpful for you as you put together your production environment; workflows (and other objects in the tidymodels) have required_pkgs() methods, which return the packages needed to predict() with the supplied object.

r-lib / carrier

Problem crating a lightgbm model #11