Open kyleGrealis opened 3 months ago
Thanks @kyleGrealis I posted a solution on the StackOverflow post! Let me know if you still have issues. I'll leave this open so I can think of a proper "fix" for vip()
and tidymodels workflows; in short, it's tricky since tidymodels wraps its own class. Here's the reprex from my end:
library(vip)
#>
#> Attaching package: 'vip'
#> The following object is masked from 'package:utils':
#>
#> vi
library(MASS)
library(tidymodels)
data(Boston, package = "MASS")
# Make a classificaiton outcome
df <- Boston |>
mutate(is_big = factor(if_else(medv > 22, 1, 0)))
# Split the data into train and test set
set.seed(7)
splits <- initial_split(df)
train <- training(splits)
test <- testing(splits)
# Preprocess with recipe
rec <- recipe(
formula = is_big ~ .,
data = train
)
svm_spec <- svm_rbf(margin = 0.0937, cost = 20, rbf_sigma = 0.0208) %>%
set_engine("kernlab") %>%
set_mode("classification")
# Putting into workflow
svr_fit <- workflow() %>%
add_recipe(rec) %>%
add_model(svm_spec) %>%
fit(data = train)
# Extract the raw underlying fit
original_fit <- workflows::extract_fit_engine(svr_fit)
# Prediction wrapper should return a vector of probabilities for the second class
pfun <- function(object, newdata) {
kernlab::predict(object, newdata, type = "probabilities")[, 2L]
}
# Sanity check
original_fit %>%
pfun(train) %>%
head()
#> [1] 0.0181795600 0.0907770250 0.0471238327 0.0002242344 0.0086771825
#> [6] 0.9999855454
# Now this should work
original_fit %>%
vip(
method = "permute",
nsim = 5,
target = "is_big", metric = "roc_auc", event_level = "second",
pred_wrapper = pfun,
train = train
)
# Alternatively, you can define a prediction wrapper for the workflow object
# directly; vip() seems to be bugged with tidymodels workflows
svr_fit %>%
vi(
method = "permute",
nsim = 5,
target = "is_big", metric = "roc_auc", event_level = "second",
pred_wrapper = function(object, newdata) predict(object, newdata, type = "prob")[[".pred_1"]],
train = train
)
#> # A tibble: 14 × 3
#> Variable Importance StDev
#> <chr> <dbl> <dbl>
#> 1 medv 0.341 0.0306
#> 2 rad 0.0106 0.00233
#> 3 ptratio 0.0102 0.00270
#> 4 rm 0.00719 0.00141
#> 5 age 0.00480 0.00215
#> 6 lstat 0.00315 0.00165
#> 7 dis 0.00210 0.000703
#> 8 nox 0.00199 0.000971
#> 9 tax 0.00197 0.000934
#> 10 chas 0.00132 0.000322
#> 11 crim 0.000374 0.000587
#> 12 black 0.000284 0.000388
#> 13 indus 0.000221 0.000488
#> 14 zn 0.0000681 0.000306
Created on 2024-08-06 with reprex v2.1.0
Careful though! Your example includes leakage since your binary outcome is a direct function of medv
which is also included as an input; hence the large importance score for the latter.
Thank you for your time answering this! And I appreciate your disclaimer here too. Applied to our working dataset, this solution produced exactly what we're looking for. So MANY thanks to you!!
Hi! My boss and I are stuck on a problem and are looking for advice. Here is a reprex that was posted to Stack Overflow outlining our issue.
TL;DR: VIP info on a SVM model that is doing classification
Thank you for your time!