tidymodels / infer

An R package for tidyverse-friendly statistical inference
https://infer.tidymodels.org
Other
721 stars 80 forks source link

Confidence intervals for multiple explanatory variables #507

Closed wmorgan485 closed 10 months ago

wmorgan485 commented 11 months ago

< -- ~~~~~~~~~~~~ -->

The problem

I'm having trouble with computing confidence intervals using the infer pipeline example for multiple explanatory variables here. When following the example code, the confidence intervals do not include the point estimates. Removing the hypothesize() command and setting generate() to type = "bootstrap" seems to fix the problem.

Reproducible example

# Checking infer example for multiple explanatory variables CIs
# https://infer.netlify.app/articles/observed_stat_examples.html

library(tidyverse)
library(infer)
library(ISLR)
library(skimr)

credit_slim <- Credit %>% # modified from Chapter 6 of moderndive.com
  as_tibble() %>%
  select(ID, debt = Balance, credit_limit = Limit,
         income = Income, credit_rating = Rating, age = Age)

credit_slim %>%
  select(debt, credit_limit, age) %>%
  cor() # debt is correlated with credit_limit, but not age
#>                     debt credit_limit         age
#> debt         1.000000000    0.8616973 0.001835119
#> credit_limit 0.861697267    1.0000000 0.100887922
#> age          0.001835119    0.1008879 1.000000000

set.seed(1)
obs_fit <- credit_slim %>% 
  specify(debt~credit_limit+age) %>% 
  fit()
obs_fit
#> # A tibble: 3 × 2
#>   term         estimate
#>   <chr>           <dbl>
#> 1 intercept    -173.   
#> 2 credit_limit    0.173
#> 3 age            -2.29

# code based on example for Confidence intervals: Multiple explanatory variables
null_dist <- credit_slim %>% 
  specify(debt~credit_limit+age) %>% 
  hypothesise(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  fit()

get_pvalue(null_dist, obs_stat = obs_fit, direction = "two-sided") 
#> Warning: Please be cautious in reporting a p-value of 0. This result is an
#> approximation based on the number of `reps` chosen in the `generate()` step.
#> See `?get_p_value()` for more information.

#> Warning: Please be cautious in reporting a p-value of 0. This result is an
#> approximation based on the number of `reps` chosen in the `generate()` step.
#> See `?get_p_value()` for more information.
#> # A tibble: 3 × 2
#>   term         p_value
#>   <chr>          <dbl>
#> 1 age            0.086
#> 2 credit_limit   0    
#> 3 intercept      0
# p-values as expected

get_confidence_interval(null_dist, point_estimate = obs_fit)
#> Using `level = 0.95` to compute confidence interval.
#> # A tibble: 3 × 3
#>   term         lower_ci upper_ci
#>   <chr>           <dbl>    <dbl>
#> 1 age           -2.64     2.51  
#> 2 credit_limit  -0.0189   0.0200
#> 3 intercept    352.     681.
# but confidence intervals are not centered at estimates

# remove hypothesize and generate bootstap samples instead
# as with other examples in Confidence intervals section
boot_dist <- credit_slim %>% 
  specify(debt~credit_limit+age) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  fit()

get_confidence_interval(boot_dist, point_estimate = obs_fit)
#> Using `level = 0.95` to compute confidence interval.
#> # A tibble: 3 × 3
#>   term         lower_ci upper_ci
#>   <chr>           <dbl>    <dbl>
#> 1 age            -3.55    -1.00 
#> 2 credit_limit    0.165    0.183
#> 3 intercept    -256.     -93.8
# now confidence intervals are centered at estimates

<sup>Created on 2023-09-23 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
simonpcouch commented 11 months ago

Thank you for the issue!

You're correct—our docs for CIs in multiple regression ought to remove the hypothesize step and use the bootstrap.

wmorgan485 commented 11 months ago

Great. Then is it appropriate to make a pull request with how to fix this documentation? It looks like a few tweaks starting at line 1361 of this page should work. (Apologies for the naive question but I have limited experience making pull requests.)

simonpcouch commented 11 months ago

Sure thing!

Instead of editing the html directly, though, you'll make the edits in the vignette source:

https://github.com/tidymodels/infer/blob/c610ea01d80c45acdf9e69994402f8412ccf73f4/vignettes/observed_stat_examples.Rmd#L1685-L1689

...and then the docs online will update themselves!

simonpcouch commented 10 months ago

Closed by #508.

github-actions[bot] commented 9 months ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.