rstudio / mleap

R Interface to MLeap
http://spark.rstudio.com/guides/mleap/
Apache License 2.0
24 stars 9 forks source link

how to handle feature vectorizer for sample input? #33

Closed geoHeil closed 5 years ago

geoHeil commented 5 years ago

I currently get:

.IllegalArgumentException: Field "features" does not exist.
Available fields: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, big_hp, features_1, features_2_62, features_16_46

For an example of:

library(sparklyr)
library(mleap)
# install if you do not already have it
# spark_install(version = "2.3.3")
spark <- spark_connect(master = "local", version = "2.3.3")
mtcars_tbl <- sdf_copy_to(spark, mtcars, overwrite = TRUE)
# Create a pipeline and fit it
pipeline <- ml_pipeline(spark) %>%
  ft_binarizer("hp", "big_hp", threshold = 100) %>%
  ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>%
  ml_gbt_regressor(label_col = "mpg")
pipeline_model <- ml_fit(pipeline, mtcars_tbl)

# Export model
model_path <- file.path(tempdir(), "mtcars_model.zip")

sample_input_mt <- data.frame(
  mpg= 21.0,
  cyl = 6,
  disp = 160.0,
  hp = 110,
  drat = 3.90,
  wt = 2.620,
  qsec= 16.46,
  vs= 0,
  am=1,
  gear= 4,
  carb=4,
  stringsAsFactors = FALSE
)
sample_input_mt_tbl <- copy_to(spark, sample_input_mt, overwrite = TRUE)

# this fails as features added during feature engineering are not yet part of the sample data
ml_write_bundle(pipeline_model, sample_input_mt_tbl, path= model_path, overwrite = TRUE)

# let's look for the features being added
pipeline_above <- ml_pipeline(spark) %>%
  ft_binarizer("hp", "big_hp", threshold = 100) %>%
  ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") 

fitted_pipeline_above <- ml_fit(pipeline_above, sample_input_mt_tbl)
fitted_pipeline_above %>% 
  ml_transform(sample_input_mt_tbl) %>%
  glimpse()

# and try again
input_new <- data.frame(
  mpg= 21.0,
  cyl = 6,
  disp = 160.0,
  hp = 110,
  drat = 3.90,
  wt = 2.620,
  qsec= 16.46,
  vs= 0,
  am=1,
  gear= 4,
  carb=4,
  big_hp=1,
  features=list(1.00, 2.62, 16.46),
  stringsAsFactors = FALSE
)
input_new <- copy_to(spark, input_new, overwrite = TRUE)

# this fails as features added during feature engineering are not yet part of the sample data
ml_write_bundle(pipeline_model, input_new, path= model_path, overwrite = TRUE)

The problem must be the: features=list(1.00, 2.62, 16.46), which is unnested in RStudio and not representing: `$ features [<1.00, 2.62, 16.46>]``

How can I get it to be of the same type?

kevinykuo commented 5 years ago

@geoHeil in the latest devel you just need to pass the sample input instead of a sample "transformed" dataset (see https://github.com/rstudio/mleap/pull/32 and https://github.com/rstudio/mleap/issues/31). So you should be able to do something like below since your only inputs are hp, wt, and qsec:

input_new <- data.frame(
  hp = 110,
  wt = 2.620,
  qsec= 16.46,
  stringsAsFactors = FALSE
)
input_new <- copy_to(spark, input_new, overwrite = TRUE)

# this should be OK
ml_write_bundle(pipeline_model, input_new, path= model_path, overwrite = TRUE)
geoHeil commented 5 years ago

thx. This seems to work.