plot.gbt plot for categorical variables

benmarchi commented 4 years ago

Thanks for sharing a great package. I am interesting in generating some PDPs for a project that uses XGBoost for fitting, where many of the predictors are categorical. I was having trouble getting pdp::partial to work with one-hot encoded the categorical predictors, so I was super excited to see that you had an implementation.

Here is the relevant code chuck from plot.gbt:

if (is.factor(mod_dat[[pn]])) {
    fn <- paste0(pn, levels(mod_dat[[pn]]))[-1]
    effects <- rep(NA, length(fn))
    nr <- length(fn)
    for (i in seq_len(nr)) {
        seed <- x$seed
        pdi <-  pdp::partial(
            x$model, pred.var = fn[i], plot = FALSE,
            prob = x$type == "classification", train = dtx
        )
        effects[i] <- pdi[pdi[[1]] == 1, 2]
    }
    pgrid <- as.data.frame(matrix(0, ncol = nr))
    colnames(pgrid) <- fn
    base <-  pdp::partial(
        x$model, pred.var = fn,
        pred.grid = pgrid, plot = FALSE,
        prob = x$type == "classification", train = dtx
    )[1, "yhat"]
    pd <- data.frame(label = levels(mod_dat[[pn]]), yhat = c(base, effects)) %>%
        mutate(label = factor(label, levels = label))
    colnames(pd)[1] <- pn
    plot_list[[pn]] <- ggplot(pd, aes_string(x = pn, y = "yhat")) +
        geom_point() +
        labs(y = "")
}

My question is related to how you are getting the marginal contributions for each factor level in a categorical variable. From what I am able to see, you are looping through each column in the encoded model.matrix that corresponds to a level in the categorical variable. You then use pdp::partial to calculate the PDP for that encoded feature. My hesitation with this method is that by computing the partial dependence for each encoded level independently you are potentially not getting the true marginal contribution. This is because you may not isolating the contribution of each factor level.

Take the following model.matrix of a variable, var1, with three factor levels, A, B, C, as an example:

df <- data.frame("var1B" = c(1,0,1,0,0), "var1C" = c(0,0,0,1,0))

#   var1B var1C
# 1     1     0
# 2     0     0
# 3     1     0
# 4     0     1
# 5     0     0

When using pdp::partial on each encoded column individually, you can run into a situation where you get impossible observations. For example, if we look at how the PDP for var1B is computed, first all the values of var1B are set to 0, then all are set to 1. Setting everything to 0, doesn't necessarily cause any issues. However, when all var1B are set to 1, we potentially encounter observations that are impossible. In this toy example, the issue appears on row 4. Namely, that var1B = var1C = 1. Physically, this means that var1 is both B and C, which is not possible. So, do you think a more appropriate implementation for encoded categorical variables would be to reset all the other encoded columns to zero before computing the PDP?

This could be accomplish by slightly modifying the inner loop for categorical variables in plot.gbt:

for (i in seq_len(nr)) {
    seed <- x$seed
    dtxCat <- dtx
    dtxCat[, setdiff(fn, fn[i])] <- 0
    pdi <-  pdp::partial(
        x$model, pred.var = fn[i], plot = FALSE,
        prob = x$type == "classification", train = dtxCat
    )
    effects[i] <- pdi[pdi[[1]] == 1, 2]
}

What are your thoughts?

vnijs commented 4 years ago

Interesting suggestion @benmarchi. This was a first attempt at getting pdp for categorical variables and there are indeed likely ways to improve. What I wanted to do was check how this issue is addressed with PDP for Random Forest models that use ranger. Have you perhaps looked at that implementation? It might also be worthwhile to reach out to the pdp author for suggestions

vnijs commented 4 years ago

@benmarchi I implemented your suggestion. Please try it out

install.packages("radiant.update", repos = "https://radiant-rstats.github.io/minicran/")
radiant.update::radiant.update()
remotes::install_github("radiant-rstats/radiant.model")

benmarchi commented 4 years ago

Excellent! I will give it a try.

Also, I have not had a chance yet, but I am planning on looking into how other tree-based packages deal with PDPs. I will provide an update if I find out anything useful.

vnijs commented 4 years ago

That sounds good @benmarchi. I'll keep this issue open for a while then

vnijs commented 1 year ago

FYI I have moved to Permutation importance for (almost) all models in Radiant. The one for xgboost is a bit tricker but the basics should work.

radiant-rstats / radiant.model

plot.gbt plot for categorical variables #4