Closed waynelapierre closed 3 years ago
Hello,
Can you be more specific about what functions in MatchIt
are running slow? I can think of a few reasons why certain functions might run slow and some ways to speed them up, but without further information, there is little I can do.
If you are estimating propensity scores within matchit()
, then the estimation of the propensity scores could be slow due to the fixed effects. By default, the estimation of propensity scores is performed by glm()
, so it may be that glm()
is running slowly. One way to get around this would be to estimate the propensity scores outside matchit()
using a package specifically designed to handle fixed effects quickly, such as the fixest
package, and then supply those propensity scores to matchit()
with the distance
argument. For example, if your fixed effect variable is called cl
(i.e., for cluster), you could run the following:
fefit <- fixest::feglm(treat ~ X1 + X2 | cl, data = data, family = binomial)
ps <- fefit$fitted
m.out <- matchit(treat ~ X1 + X2, data = data, distance = ps)
Other propensity score-estimation methods may simply be incompatible with fixed effects, like cbps
, which you should therefore avoid using.
If you are performing Mahalanobis distance or genetic matching, matchit()
may need to invert and multiply huge matrices if there are lots of fixed effects and many units. This cannot be avoided except by excluding the fixed effects from the calculation of the Mahalanobis distance.
If summary()
is running slow after including fixed effects in the matchit()
model formula, that is because summary()
needs to compute balance on every fixed effect individually, which can take a long time. You can avoid this by using the first method I recommended so that the fixed effects are included in the propensity score but not in the matchit()
object, or by using cobalt
to assess balance instead of MatchIt
since cobalt
offers finer control of which covariates are included.
Let me know if any of this helped, or please provide more detail so I can better address the problem.
I would suggest to use exact restricting. That is, match within groups that define fixed effects. The idea of fixed effects is basically within-group comparison and matching exactly on groups is usually a better strategy. See this paper and this one show the equivalence (or lack thereof) between fixed effects and matching. The first paper is about one-way fixed effects while the other paper is about two-way fixed effects.
Thanks so much! The fixest method fixed my problem! I have another follow-up question, how can I specify that for all my treated observations, each one's matched observation should be in the sample group (such as industry, year, etc) and the matched distance cannot be higher than 0.1? If some treated observations do not have a matched observation that satisfies these requirements, then delete them from the treated group.
Use the exact
argument to request exact matching on those characteristics, i.e., exact = ~industry+year
. This ensures that each treated unit's match is within the same industry and year. Use the caliper
argument to restrict the distance between matches. By default, the caliper is in standard deviation units of the distance measure (i.e., propensity score). Use the std.caliper
argument to control whether the caliper should be in raw units. For example, caliper = .1, std.caliper = FALSE
ensures that each treated unit's match has a propensity score within .1 of the treated unit's propensity score. You can also place calipers on individual covariates in addition to the propensity score. Any treated units that don't have matches that satisfy the exact
and caliper
restrictions will be dropped.
Thanks so much. I just want to make sure that the variables supplied to the exact argument do not have to be in the variables used for matching. For example, matchit(y ~ x, exact = ~z + h, data = data) will work.
My understanding is that you aren't using any variables for matching except the propensity score, which is supplied to distance
. The variables in the main formula are used solely for balance checking with summary()
but will not affect the match if you provide already-estimated propensity scores to the distance
argument (unless you're using genetic matching).
The variables in exact
and caliper
just need to be in the dataset supplied to data
and don't need to be specified anywhere else, so that example you provided should work fine as long as z
and h
are in data
.
OK. Thanks for the clarification.
It seems that when I supply machit's distance with a feglm model fitted value. Setting caliper to 0.1 and std.caliper to FALSE does not drop the matched observations with a distance higher than 0.1. Is this a bug?
You need to provide more information for me to help you. Please provide your code and the results that you think are in error and I can try to assess.
My bad, I mistyped it. Thanks again for the great package and help!
The MatchIt package gets very slow when I add fixed effects. Are there any ways to make this kind of operation faster?