pbiecek / breakDown

Model Agnostics breakDown plots
https://pbiecek.github.io/breakDown/
103 stars 16 forks source link

Something off with coefitients? #9

Open DemGrg opened 6 years ago

DemGrg commented 6 years ago

Hi, I don't understand how the broken function calculates the coefficients? (or something is off?)

In the lm function this is my test result:

summary(model)

Call: lm(formula = TotalCharges ~ ., data = data_in_test)

Residuals: Min 1Q Median 3Q Max -1943.33 -453.71 -94.64 490.26 1887.26

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -2162.4583 21.9717 -98.420 < 2e-16 MonthlyCharges 36.1234 0.3080 117.301 < 2e-16 tenure 65.3606 0.3683 177.476 < 2e-16 SeniorCitizen -86.7050 24.3449 -3.562 0.000371

Test user: -2162.4583 + (data_in_test[analysed_user,]$MonthlyCharges 36.1234) + data_in_test[analysed_user,]$tenure65.3606 + data_in_test[analysed_user,]$SeniorCitizen*(-86.7050)

[1] 721.2045

While you get: (u can see that the intercept is different)

lm_br contribution (Intercept) 2283.300 tenure = 3 -1923.025 MonthlyCharges = 74.4 346.850 SeniorCitizen = 0 14.081 final_prognosis 721.206 baseline: 0

Obviously one would expect that contributions of a waterfall plot would be simply Y=intercept + beta*value ... etc. from the summary output?

pbiecek commented 6 years ago

Yes, actually there is a pretty cool reason why you do not want to have beta*value as separate contributions (see below)

In the broken object values that are calculated as beta*centered(value)

This is to make contributions resistant to shifting of an X variables. Like you will get same brokenDown plots despite having temperature in celsius of fahrenheits. Beta coefficients take care about scale, but location needs to be done separately. Also, since values are centered, the intercept is shifted as well.

It is easy to get such individual contributions. The way how this is implemented in the breakDown package is through (no extra calculations are needed)

predict.lm(model, newdata, type = "terms")
alathrop commented 6 years ago

Thank you for the explanation! May I suggest giving the user the option to use the centered or regular x values, as well as providing some explanation in the documentation. This is a great chart, but confusing without any explanation of using type = "terms"

pbiecek commented 6 years ago

Yes, some documentation is required. Winter semester has just ended so I will have some time to work on it.

larmarange commented 6 years ago

dear @pbiecek

Following @alathrop it would be great to have an option for having directly the application of the different terms rather than the centered values.

I completely understand for point of view. But in other context, such plot would be relevant, e.g. for pedagogic purpose. When teaching, I often need to explain to my students how a single prediction is obtained from a model, in particular when explaining how to interpret interactions.

Thanks for this package

larmarange commented 6 years ago

Maybe some code could be helpful. I have tried the following.


betas <- function (object, newdata)
{
  tt <- terms(object)
  Terms <- delete.response(tt)
  mm <- model.matrix(Terms, newdata)
  ass <- attr(mm, "assign")
  tl <- attr(Terms, "term.labels")

  co <- coef(object)
  pred <- co * mm

  ret <- matrix(rep_len(NA, length.out = length(tl) * nrow(newdata)), nrow = nrow(newdata))
  colnames(ret) <- tl
  rownames(ret) <- rownames(ret)

  for (i in 1:length(tl)) {
    ret[, i] <- rowSums(pred[, ass == i, drop = FALSE], na.rm = TRUE)
  }
  attr(ret, "constant") <- rowSums(pred[, ass == 0, drop = FALSE], na.rm = TRUE)

  ret
}

At the beginning of broken.glm, simply use ny <- betas(model, new_observation) instead of predict and all the rest of the function will still be working.

Would you consider adding such options?

larmarange commented 6 years ago

I have prepared a pull request, just in case

pbiecek commented 6 years ago

Thanks, merged. Rendered examples are here: https://pbiecek.github.io/breakDown/reference/broken.lm.html

larmarange commented 6 years ago

thanks