kosukeimai / MatchIt

R package MatchIt
210 stars 43 forks source link

Suggestion for Concrete Selection of Covariates #175

Closed Maxi54321 closed 1 year ago

Maxi54321 commented 1 year ago

Hello,

in the Matchit-guide and in other associated literature, it is suggested that one should use Outcome Predictors and True Confounders (with impact on both treatment and outcome variable) for a Propensity Score Matching.

Do you think, for the formulas (e.g., treat~ …) in the matchit-commands to assess the initial balance (m.out0), the matching itself (m.out1), etc. it is sufficient to first run a regression on the treatment and then outcome variable with all potential independent variables? With the aim to see which have at least some statistically significant influence on the treatment and outcome variable to include them as covariates in the formula? Or are other ways rather than the regression more suitable?

So if I have for example age, gender, and birthplace to explain one‘s wage: Run two regressions [treat ~ age + gender + birthplace] and [wage ~ age + gender + birthplace] and if for example only age and gender have a statistically significant influence on both treat and wage, proceed with only these? [m.out0 <- matchit(treat ~ age + gender, data = data, method = NULL, distance = "glm"] and so forth.. ?

I would be very thankful for your advice, thanks in advance! Also a big thanks for the availability of the matchit package, it will sure be included in my master‘s thesis without a doubt and with pleasure!

Kind regards, Max

ngreifer commented 1 year ago

Thank you for the suggestion, Max. This is not something I will add to the documentation for several reasons. First is that I disagree with this practice for many reasons. These include the following:

Second is that covariate choice is not a requirement unique to matching but rather applies to all causal inference methods that rely on covariate adjustment, including weighting, doubly robust estimation, and regression. The MatchIt vignettes are not a general-purpose guide for causal inference; they exist only to show how to use the package correctly given what the user already knows about the system under study. Also, there are uses of matching other than for estimating causal effects, and those uses would not require this additional step.

Third, to assess whether a variable is related to the treatment, you only need to assess whether it is imbalanced or not prior to matching, which is already recommended as a first step. But having imbalance prior to matching is not a criterion for inclusion in the propensity score model; even balanced covariates can become imbalanced if they are not matched on. So you should not choose which variables to match on based on which happen to be imbalanced; you should match on all variables that are required to eliminate confounding. There are propensity score models that happen to do variable selection, like lasso, but the purpose of that is just to estimate a good propensity score. You still have to achieve balance on the variables that are omitted from the lasso solution, which is a reason not to use lasso unless you have too many variables to include in a logistic regression model.

Finally, covariate choice needs to be determined by theory, not by models. If a variable is a well-known confounder based on theory but happens not to be significant in your outcome model, your audience will not trust your result if you omit it from being matched on. It is better to unnecessarily balance a variable that might be a confounder than to omit from balancing a real confounder. That is, it is better to be safe than sorry.

If you don't care about maintaining the separation between design and analysis and are really set on using outcome information to balance covariates, there are methods that do variable selection in this way that maintain valid inference. These include collaborative TMLE and outcome-adaptive lasso. These force your audience to trust that you have correctly adjusted for all covariates, whereas matching allows you to prove to your audience that you did by displaying balance.

Maxi54321 commented 1 year ago

Thank you very much for this extensive and comprehensible answer. So as a conclusion: "I have to think by myself and look into theory what could be a sufficient covariate in my model - but with the recommendation not to include one that only influences the treatment variable (Treatment-only predictor). If there are doubts, an inclusion is better than exclusion ("It is better to unnecessarily balance a variable that might be a confounder than to omit from balancing a real confounder.")" I hope I got it right. And also thanks for the explanation on the non-suitability of regression/correlation-tests to assess the covariates to be selected in the model.

ngreifer commented 1 year ago

Yes, I think that is a good conclusion. One strategy I typically recommend is to rank your covariates in order from most important to the outcome to least important. Apply a strict balance criterion to the most important ones and allow a relaxed balance criterion to the least important. That way, you won't lose so much precision by balancing an instrument-like variable because you are allowing it to have some imbalance remaining. This is why plot(summary(.)) produces two balance thresholds by default.

You can also split your sample into two pieces, and run some exploratory analyses in one of those pieces, figuring out which variables are important to the outcome. It might be useful to run a random forest and then compute variable importance measures, since these aren't hypothesis tests, don't require strict linearity, and allow you to rank the importance of the covariates. You can see how these results line up with your causal theories. Then you can use those results in the other half of your sample, which you keep pure by running a standard matching analysis without involving the outcome. Inference on the second sample should remain valid. This is similar to an approach described by Aikens et al. (2021), which uses the prognostic score for matching by fitting the prognostic score model in a hold-out sample. You lose precision by decreasing the size of your sample but can gain precision by incorporating information about the outcome.