vincentarelbundock / marginaleffects

R package to compute and plot predictions, slopes, marginal means, and comparisons (contrasts, risk ratios, odds, etc.) for over 100 classes of statistical and ML models. Conduct linear and non-linear hypothesis tests, or equivalence tests. Calculate uncertainty estimates using the delta method, bootstrapping, or simulation-based inference
https://marginaleffects.com
Other
427 stars 45 forks source link

Handling of spline variables with missing data when using the magrittr pipe #802

Closed drcaprosser closed 1 year ago

drcaprosser commented 1 year ago

Hello,

I've encountered what seems to be a fairly specific bug in how marginaleffects is handing spline variables when making predictions with missing values in the original variable when the data is being piped in with the magrittr pipe.

When there is missing data and I use the magrittr pipe to feed in the data, marginaleffects makes predictions at the at the values of the first spline, rather than the original variable.

Confusingly, it works just fine when you specify the dataset in the lm() call rather than pipe it in, or use the base pipe (|>), and the magrittr pipe also works fine when there are no NAs in the data.

I've simulated a little example of this behaviour:

library(tidyverse) 
library(splines)
library(marginaleffects)

set.seed(5623)

# simulate data with missingness 
df <- tibble(x = sample(seq(0, 10, 1), replace = TRUE, size = 500),
             y = x - .05*x^2 + rnorm(500)) %>% 
  mutate(x = na_if(x, rbinom(500, 1, .1)))

# Include data in lm call (this works)
lm(y ~ bs(x), df) %>% 
  avg_predictions(by = "x")

# Pipe in data with magrittr pipe (this doesn't work)
df %>% 
lm(y ~ bs(x), .) %>% 
  avg_predictions(by = "x")

# Pipe in data with magrittr pipe removing NAs (this works)
df %>% 
  filter(!is.na(x)) %>% 
  lm(y ~ bs(x), .) %>% 
  avg_predictions(by = "x")

# Pipe in data with base pipe (this works)
df |>
  lm(y ~ bs(x), data = _) |>
  avg_predictions(by = "x")

When it works, the predictions look like this:

image

When it doesn't work, the predictions look like this:

image

The x values here are the values of the first spline:

image

While I've figured out a work around for my own purposes, I thought I'd file an issue in case it flagged a more general problem that might affect other things.

P.S. Thanks for the amazing package!

vincentarelbundock commented 1 year ago

Thanks for the report!

Could you install the latest version of marginaleffects from GitHub and see if that fixes the issue? (Make sure to restart R completely after installing the package.)

On my computer, the two commands below give the same result (as expected).

library(dplyr)
library(splines)
library(marginaleffects)
set.seed(5623)

df <- tibble(x = sample(seq(0, 10, 1), replace = TRUE, size = 500),
             y = x - .05*x^2 + rnorm(500)) %>% 
  mutate(x = na_if(x, rbinom(500, 1, .1)))

lm(y ~ bs(x), df) %>% avg_predictions(by = "x")
# 
#   x Estimate Std. Error     z Pr(>|z|)     S  2.5 % 97.5 %
#  10    5.126     0.1269 40.41   <0.001   Inf  4.877  5.374
#   3    2.548     0.0776 32.84   <0.001 783.3  2.396  2.700
#   2    1.743     0.0775 22.49   <0.001 369.6  1.591  1.895
#   4    3.216     0.0725 44.36   <0.001   Inf  3.074  3.358
#   8    4.798     0.0777 61.79   <0.001   Inf  4.646  4.950
#   1    0.789     0.1125  7.01   <0.001  38.6  0.568  1.009
#   9    4.989     0.0752 66.31   <0.001   Inf  4.841  5.136
#   5    3.761     0.0652 57.65   <0.001   Inf  3.634  3.889
#   7    4.539     0.0768 59.07   <0.001   Inf  4.388  4.689
#   6    4.198     0.0681 61.67   <0.001   Inf  4.064  4.331
#   0   -0.330     0.2278 -1.45    0.148   2.8 -0.776  0.117
# 
# Columns: x, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

df %>% 
lm(y ~ bs(x), .) %>% 
  avg_predictions(by = "x")
# 
#   x Estimate Std. Error     z Pr(>|z|)     S  2.5 % 97.5 %
#  10    5.126     0.1269 40.41   <0.001   Inf  4.877  5.374
#   3    2.548     0.0776 32.84   <0.001 783.3  2.396  2.700
#   2    1.743     0.0775 22.49   <0.001 369.6  1.591  1.895
#   4    3.216     0.0725 44.36   <0.001   Inf  3.074  3.358
#   8    4.798     0.0777 61.79   <0.001   Inf  4.646  4.950
#   1    0.789     0.1125  7.01   <0.001  38.6  0.568  1.009
#   9    4.989     0.0752 66.31   <0.001   Inf  4.841  5.136
#   5    3.761     0.0652 57.65   <0.001   Inf  3.634  3.889
#   7    4.539     0.0768 59.07   <0.001   Inf  4.388  4.689
#   6    4.198     0.0681 61.67   <0.001   Inf  4.064  4.331
#   0   -0.330     0.2278 -1.45    0.148   2.8 -0.776  0.117
# 
# Columns: x, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
drcaprosser commented 1 year ago

Ah indeed it does seem to work fine with the development version. I had the latest CRAN version installed, so something you change since then seems to have fixed whatever the problem was. Apologies for the slightly unnecessary report and thanks again!

vincentarelbundock commented 1 year ago

Great news! Thank for confirming!