r-causal / causal-inference-in-R

Causal Inference in R: A book!
https://www.r-causal.org/
169 stars 42 forks source link

Add description, prediction, and ci #164

Closed malcolmbarrett closed 10 months ago

malcolmbarrett commented 10 months ago

Closes #57

malcolmbarrett commented 10 months ago

@LucyMcGowan @tgerke ready for review. I'll leave it up until next week before merging if you have a chance to take a look

malcolmbarrett commented 10 months ago

Yeah let's hash this out to figure out the best way to clarify this. One caveat is that this is very early in the book, and we can't get too in the weeds. Maybe we should add "causal and predictive models revisited" in the DAGs chapter post quartets, because then we have a better framework for discussing the nitty gritty details + time ordering + the different DAGs that would produce such situations.

The first thing for me to clarify here is to make sure that it's clear that I mean sometimes, and that it's not possible to tell which times with data alone. It's causality all the way done, so predictive models can be great causal models and vice versa.

Like the “best” causal model would include x, all the things that cause x and y, but if they also include all the things that cause y the worst that would happen is..increased precision?

That's true, so I need to change my wording. What I mean is that an unbiased causal model won't necessarily predict well, and that predictive power is not an indication of unbiasedness of a causal effect

It wouldn’t be possible to build a “good” (where I’m defining good as also useful) prediction model that includes anything post outcome, so colliders in the traditional sense couldn’t be in there (outside of strange circumstances like M-bias but if you actually have the perfect prediction model you have that U2 value so M-bias is also not possible)

I think this is a better topic for later, but the fact of the matter is that people build lots of useless models with high predictive power. We probably should talk about time ordering later as it relates to this topic, but it's still possible that a time-ordered predictive model will not give you the correct estimates for the individual coefficients, and that even if they do, it still might not be the best model for prediction because of the bias-variance tradeoff

Shmueli makes the point that the better and larger the data, the closer these two types of models approach each other. That + time ordering does help reduce a lot of these issues, so it's worth discussing that with a principled model on quality data, at the very least it can point you in the direction of causes, even if you would still need a causal model to get the least biased answer (but that makes me think of the Hypothesis Machine article). After all, science manages to proceed despite an absurd amount of poor methodology

LucyMcGowan commented 10 months ago

Love the idea of discussing it later in the book, and yes! I think the big point is with lots of data and all variables measured these things might be the same but that is a big if

malcolmbarrett commented 10 months ago

Ok, I clarified things a bit and will plan on adding that section to the DAGs chapter (and incorporate some of your other thoughts there)

tgerke commented 10 months ago

here's a minimal .qmd example of the causal model being bad at prediction. Apologies for not using fancy augment() functions and such, someday I'll join the 21st century!

---
format: html
---

```{r include=FALSE}
library(tidyverse)

What we want to show

The true causal model is

$y = \beta_1x_1 + \beta_2x_2$

but a prediction model which is causally biased with $\hat\beta_2 = 0$ is more accurate. I.e.

$\hat y = \hat\beta_1x_1$

is better at prediction but has an incorrect causal specification.

This will be true when:

  1. The outcome is very noisy
  2. $\beta_2$ is very small
  3. $x_1$ and $x_2$ are highly correlated
  4. The sample size is small or the range of $x_2$ is small

We set up a simulation for a correct causal model $y = 10x_1 + x_2$ as so

set.seed(8675309)
n <- 100

# simulate the exposure variables
x1 <- 100*rnorm(n)
x2 <- x1/100 + rnorm(n, sd = .1)

# simulate the outcome
y <- 10*x1 + x2 + rnorm(n, sd = 100)

df_sim <- tibble(y = y, x1 = x1, x2 = x2)

We see that $x_1$ and $x_2$ are highly correlated, with $x_2$ having a small range.

df_sim |> 
  ggplot() + 
  geom_point(aes(x = x1, y = x2))

Also, $y$ is very noisy

df_sim |> 
  ggplot() + 
  geom_histogram(aes(x = y))

If we use the true causal model we get a prediction RMSE of

preds_causal <- 10*df_sim$x1 + df_sim$x2
sqrt(mean(df_sim$y - preds_causal)^2)

With the biased prediction model, we get a smaller RMSE of

preds_biased <- 10*df_sim$x1
sqrt(mean(df_sim$y - preds_biased)^2)
malcolmbarrett commented 10 months ago

thanks @tgerke! I'm going to merge this without the simulation so we can polish it outside the book and decide if we also want to do something else with it