Suggestion for change of how predicted values are defined

lang-benjamin commented 4 years ago

[ ] a possible bug
[ ] a question about package functionality
[x] a suggested code or documentation change, improvement to the code, or feature request

When calculating predictions with cplot() what = "prediction" results in predictions holding values of the data at their mean or mode. Although the question what we are trying to estimate is not unique, in the context of AME it seems more reasonable to me to produce different predictions. My proposal is based on the work from Bartus https://journals.sagepub.com/doi/pdf/10.1177/1536867X0500500303

Let us assume we have a binary variable x_i. According to equation (5) the AME estimator can be written as AMEi = 1/n sum{k = 1}^n [F(beta x^k | x_i^k = 1) - F(beta x^k | x_i^k = 0)]

or equivalently

AMEi = 1/n sum{k = 1}^n F(beta x^k | xi^k = 1) - 1/n sum{k = 1}^n F(beta x^k | x_i^k = 0).

In the latter representation we may consider each of the two terms as (conditional) predicted value. The first is for x_i = 1 and the second for x_i = 0. Those individual predicted values are consistent with the corresponding AME estimate in the sense that the difference exactly equals the AME estimate. Currently, the difference of the two predicted values is not the same as the AME estimate.

For example

library("margins")
fm <- glm(vs ~ hp + wt + factor(am), data = mtcars, family = binomial)
ans <- cplot(fm, x = "am", draw = FALSE, type = "response")
ans$yvals[[1]] - ans$yvals[[2]]  # not equal to margins(fm, variables = "am")

Applying the above mentioned formula we get:

am_1 <- mean(predict(fm, newdata = within(mtcars, am <- 1), type = "response"))
am_0 <- mean(predict(fm, newdata = within(mtcars, am <- 0), type = "response"))
am_1 - am_0 # is identical to the AME produced for am
# -0.2386167

I realize that am_1 and am_0 are quite different than the current individual predictions (yvals), but I wanted to nevertheless propose to use those instead. Happy to hear your thoughts on that.

Remarks:

Note that I was not sure whether the cplot call as above but with what = "effect" is what I was looking for. However, the result has zero observations in it, so I assume it is not.
This should analogously generalize to categorical variables with more than 2 levels (and to continuous variables)

lang-benjamin commented 4 years ago

I realised that what I have proposed above is exactly the predictive margins concept from stata. Although this is not ported into the margins package I consider those predictions more consistent with the whole package. So, in the example above my suggestion is to still use prediction(fm, at = list(am = c(0, 1))) (and I would actually reconsider adding the predictive margins option as part of this package directly).

leeper commented 4 years ago

Hi @langbe it's not going to be part of marigns, unfortunately, as I want the concepts to be in different packages for clarity. But you're right that this is what should be happening. cplot() came before the prediction package supported Stata-style predictive margins. My plan is to deprecate cplot() as it currently exists and replace it with a ggplot2-based plotting approach. As part of that, the new plotting interface will do as you suggest.

I've unfortunately had negative bandwidth for open source work due to the pandemic.

leeper / margins

Suggestion for change of how predicted values are defined #144