koalaverse / vip

Variable Importance Plots (VIPs)
https://koalaverse.github.io/vip/
186 stars 24 forks source link

problem obtaining VIP for SVM results #159

Open kyleGrealis opened 3 months ago

kyleGrealis commented 3 months ago

Hi! My boss and I are stuck on a problem and are looking for advice. Here is a reprex that was posted to Stack Overflow outlining our issue.

TL;DR: VIP info on a SVM model that is doing classification

Thank you for your time!

bgreenwell commented 3 months ago

Thanks @kyleGrealis I posted a solution on the StackOverflow post! Let me know if you still have issues. I'll leave this open so I can think of a proper "fix" for vip() and tidymodels workflows; in short, it's tricky since tidymodels wraps its own class. Here's the reprex from my end:

library(vip)
#> 
#> Attaching package: 'vip'
#> The following object is masked from 'package:utils':
#> 
#>     vi
library(MASS)
library(tidymodels)

data(Boston, package = "MASS")

# Make a classificaiton outcome
df <- Boston |> 
  mutate(is_big = factor(if_else(medv > 22, 1, 0)))

# Split the data into train and test set
set.seed(7)
splits <- initial_split(df)
train <- training(splits)
test <- testing(splits)

# Preprocess with recipe
rec <- recipe(
  formula = is_big ~ .,
  data = train
) 

svm_spec <- svm_rbf(margin = 0.0937, cost = 20, rbf_sigma = 0.0208) %>%
  set_engine("kernlab") %>%
  set_mode("classification")

# Putting into workflow
svr_fit <- workflow() %>%
  add_recipe(rec) %>%
  add_model(svm_spec) %>%
  fit(data = train)

# Extract the raw underlying fit
original_fit <- workflows::extract_fit_engine(svr_fit)

# Prediction wrapper should return a vector of probabilities for the second class
pfun <- function(object, newdata) {
  kernlab::predict(object, newdata, type = "probabilities")[, 2L]
}

# Sanity check
original_fit %>%
  pfun(train) %>%
  head()
#> [1] 0.0181795600 0.0907770250 0.0471238327 0.0002242344 0.0086771825
#> [6] 0.9999855454

# Now this should work
original_fit %>%
  vip(
    method = "permute",
    nsim = 5,
    target = "is_big", metric = "roc_auc", event_level = "second",
    pred_wrapper = pfun,
    train = train
  )


# Alternatively, you can define a prediction wrapper for the workflow object 
# directly; vip() seems to be bugged with tidymodels workflows
svr_fit %>%
  vi(
    method = "permute",
    nsim = 5,
    target = "is_big", metric = "roc_auc", event_level = "second",
    pred_wrapper = function(object, newdata) predict(object, newdata, type = "prob")[[".pred_1"]],
    train = train
  )
#> # A tibble: 14 × 3
#>    Variable Importance    StDev
#>    <chr>         <dbl>    <dbl>
#>  1 medv      0.341     0.0306  
#>  2 rad       0.0106    0.00233 
#>  3 ptratio   0.0102    0.00270 
#>  4 rm        0.00719   0.00141 
#>  5 age       0.00480   0.00215 
#>  6 lstat     0.00315   0.00165 
#>  7 dis       0.00210   0.000703
#>  8 nox       0.00199   0.000971
#>  9 tax       0.00197   0.000934
#> 10 chas      0.00132   0.000322
#> 11 crim      0.000374  0.000587
#> 12 black     0.000284  0.000388
#> 13 indus     0.000221  0.000488
#> 14 zn        0.0000681 0.000306

Created on 2024-08-06 with reprex v2.1.0

Careful though! Your example includes leakage since your binary outcome is a direct function of medv which is also included as an input; hence the large importance score for the latter.

kyleGrealis commented 3 months ago

Thank you for your time answering this! And I appreciate your disclaimer here too. Applied to our working dataset, this solution produced exactly what we're looking for. So MANY thanks to you!!