Difficulty Understanding Boolean Predictors

alexhallam commented 5 years ago

I am having difficulty with interpretation in the situation where the outcome is continuous and the predictors are Boolean.

If I were to think of effects in terms of a linear model I would turn coefficients on or off depending on whether the predictors were a 1 or a 0. Breakdown does not seem to do this.

In the example below I have chosen a point that has 0s assigned to the predictors. Again, in a linear model prediction this would simply result in setting the coefficients of these predictors to 0. With breakdown I am seeing a negative effect for versicolor and setosa.

How am I supposed to interpret the output in this situation?
Is there a way to show that since these values are 0 for this observation that they are not contributing to the final prediction?

library(reprex)
library(tidyverse)
library(breakDown)

# data prep
iris_dummy <- iris %>% 
  mutate(setosa = ifelse(Species == "setosa",1,0),
         versicolor = ifelse(Species == "versicolor",1,0)) %>% 
  select(-Species)

# fit model
fit <- lm(Sepal.Length ~Sepal.Width + Petal.Length + setosa + versicolor, data = iris_dummy)

set.seed(42)

# pick an observation
no <- iris_dummy[sample(nrow(iris_dummy), 1), ]

# use broken
br <- broken(fit, no)

# the example `no` is not setosa or versicolor yet has breakdown effects
no
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width setosa versicolor
#> 138          6.4         3.1          5.5         1.8      0          0

plot(br)

Created on 2018-10-19 by the reprex package (v0.2.0).

pbiecek commented 5 years ago

Good question, let me use an easier dataset with just one binary variable.

The true model is y = x + rnorm() where x is a binary variable

binary <- rbinom(1000,1,0.5)
y <- binary + rnorm(1000)

the fitted model is y = 0.019 + 0.966 * binary

In break down effects are relative to average model response, the average is 0.5, and for binary = 1, the model response would be close to 1, thus contribution of binary=1 is +0.5 For binary = 0, the model response would be close to 0, thus contribution of binary=0 is -0.5

In your case, the contribution -0.146 for versicolor = 0 means that versicolor = 0 is smaller than average effect for versicolor (since it's binary variable it's average from two values, but it does not matter).

Btw: you can use the Species variable in the model,

fit <- lm(Sepal.Length ~Sepal.Width + Petal.Length + Species, data = iris)
no <- iris[1,]
br <- broken(fit, no)
plot(br)

Now it is easier to see that Species = virginica results in lower predictions, but this negative effect of virginica is partially visible because versicolor = 0 and setosa = 0.

Hope it helps

alexhallam commented 5 years ago

I think using Species in the model results in a plots which makes much more sense. Thank you!

pbiecek / breakDown

Difficulty Understanding Boolean Predictors #26