Cross-Validation for logistic regression propensity score predictions

ibarshai commented 1 month ago

I'm trying to use MatchIt to conduct propensity score matching. I'm using logistic regression to produce the propensity scores.

As I understand, the model is trained on all samples and predicts on all samples. I have a group with 900 control samples and 550 treatment samples. I'd like to run through a stratified k-fold cross-validation on the model. I intend train on 4 folds, predict on 1 fold, so I have a non-overlapping set of held-out set predictions that I am hoping will be more robust with less overfitting.

How would I go about doing this? As I understand, this would be very useful functionality to have available in the package. In lieu of built-in functionality, do I produce my predictions outside of MatchIt and pass the probabilities of belonging to the treatment group as the distance in the the matchit object definition? Are there any major concerns with performing this sort of CV approach or gotchas I should look for?

Thank you!

ngreifer commented 1 month ago

Hi Ilya,

Cross-validation is a method of assessing the performance of a prediction model in a way that reduces the over-optimism in the assessment metric computed using the training sample alone. It is used to select a hyperparameter that yields predictions with good out-of-sample accuracy. Logistic regression doesn't have any hyperparameters, but lasso, ridge, and elastic net do. When you request propensity scores estimating using one of those three models, matchit() automatically does cross-validation to select the optimal value of the hyperparameter that controls the degree of regularization, fits the model to the full sample using that parameter, and uses that model to generate predictions for the full sample, which are used as propensity scores.

It sounds like what you are describing is "cross-fitting", which is a different method that is sometimes used in doubly-robust estimation to reduce dependence on certain modeling assumptions. You estimate the propensity score model in one split of the sample, use that model to compute propensity scores in the other split, and do the same with the splits reversed. That way, the propensity score for a given unit is never computed from a model that used that unit to train the model, but all units get a propensity score. This can be done with the outcome model, too. I have never seen this method used in the context of matching, and there are a few reasons for that.

First is that the goal of matching is to achieve balance, not to have a well-fitting propensity score model. It is unlikely that cross-fitting the propensity score will improve balance over using a single model trained on the same sample for which balance is to be assessed; this is because the model responds to the unique characteristics of that sample. Second, cross-fitting is used to prevent overfitting, but overfitting is not a problem to be solved in propensity score matching because the goal is not to create a generalizable propensity score model that has good predictive performance for an out-of-sample observation; it is to achieve balance in the same sample used to compute the propensity score. The problem of a poorly specified propensity score model (i.e., one that responds too strongly to the incidental features of the training sample) is mitigated by using covariate balance as the evaluation metric for the model rather than the same (or similar) metric usually used to both estimate and evaluate prediction models in a usual prediction context (i.e., accuracy or some related loss function). A highly predictive prediction model, which would normally be suspect in a prediction context because of its potential for over-optimism, may be a problem or not in a matching context depending on how well it achieves balance. Third, cross-validation introduces a random element to the matching, which increases the variability (which is not accounted for by the uncertainty estimation) and reduces the replicability of the analysis. Matching already struggles with arbitrary aspects of the data (e.g., many methods depend on how the data are sorted), so adding another one is not desirable unless the benefit is large, as it could be with a more flexible propensity score model like BART.

So, you are welcome to use whatever method you want to estimate the propensity scores (as long as no information about any post-treatment variable is included), and you can assess empirically which one performs best in your dataset by measuring balance after matching. I would bet that in most cases, a single logistic regression will yield better balance than a cross-fitted logistic regression like you are proposing. If you notice your propensity score model is performing poorly (i.e., in the sense that balance is not achieved on the covariates even when it is achieved on the propensity score), then you might try other methods of estimating the propensity score, which can include methods that regularize the logistic regression model or circumvent the need to specific the model parametrically, like GBM or BART. You can automate the process of finding a good propensity score model using functions in the cobalt package, which allow you to quickly assess balance; a guide and example for doing so are here and here.

If you estimate propensity scores outside matchit(), you can indeed supply them to the distance argument. You can also supply an arbitrary distance matrix to this argument if you want to customize it, too; see help("distance", package = "MatchIt") for details.

ibarshai commented 1 month ago

This is the single best reply I've ever received to anything I've ever posted on the internet. Thank you so much.

ngreifer commented 1 month ago

This comment made my day, thank you so much for saying that! Glad I could be helpful!

kosukeimai / MatchIt

Cross-Validation for logistic regression propensity score predictions #200