very slow when using fixed effects

waynelapierre commented 3 years ago

The MatchIt package gets very slow when I add fixed effects. Are there any ways to make this kind of operation faster?

ngreifer commented 3 years ago

Hello,

Can you be more specific about what functions in MatchIt are running slow? I can think of a few reasons why certain functions might run slow and some ways to speed them up, but without further information, there is little I can do.

If you are estimating propensity scores within matchit(), then the estimation of the propensity scores could be slow due to the fixed effects. By default, the estimation of propensity scores is performed by glm(), so it may be that glm() is running slowly. One way to get around this would be to estimate the propensity scores outside matchit() using a package specifically designed to handle fixed effects quickly, such as the fixest package, and then supply those propensity scores to matchit() with the distance argument. For example, if your fixed effect variable is called cl (i.e., for cluster), you could run the following:

fefit <- fixest::feglm(treat ~ X1 + X2 | cl, data = data, family = binomial)
ps <- fefit$fitted
m.out <- matchit(treat ~ X1 + X2, data = data, distance = ps)

Other propensity score-estimation methods may simply be incompatible with fixed effects, like cbps, which you should therefore avoid using.

If you are performing Mahalanobis distance or genetic matching, matchit() may need to invert and multiply huge matrices if there are lots of fixed effects and many units. This cannot be avoided except by excluding the fixed effects from the calculation of the Mahalanobis distance.

If summary() is running slow after including fixed effects in the matchit() model formula, that is because summary() needs to compute balance on every fixed effect individually, which can take a long time. You can avoid this by using the first method I recommended so that the fixed effects are included in the propensity score but not in the matchit() object, or by using cobalt to assess balance instead of MatchIt since cobalt offers finer control of which covariates are included.

Let me know if any of this helped, or please provide more detail so I can better address the problem.

Noah

kosukeimai commented 3 years ago

I would suggest to use exact restricting. That is, match within groups that define fixed effects. The idea of fixed effects is basically within-group comparison and matching exactly on groups is usually a better strategy. See this paper and this one show the equivalence (or lack thereof) between fixed effects and matching. The first paper is about one-way fixed effects while the other paper is about two-way fixed effects.

waynelapierre commented 3 years ago

Thanks so much! The fixest method fixed my problem! I have another follow-up question, how can I specify that for all my treated observations, each one's matched observation should be in the sample group (such as industry, year, etc) and the matched distance cannot be higher than 0.1? If some treated observations do not have a matched observation that satisfies these requirements, then delete them from the treated group.

ngreifer commented 3 years ago

Use the exact argument to request exact matching on those characteristics, i.e., exact = ~industry+year. This ensures that each treated unit's match is within the same industry and year. Use the caliper argument to restrict the distance between matches. By default, the caliper is in standard deviation units of the distance measure (i.e., propensity score). Use the std.caliper argument to control whether the caliper should be in raw units. For example, caliper = .1, std.caliper = FALSE ensures that each treated unit's match has a propensity score within .1 of the treated unit's propensity score. You can also place calipers on individual covariates in addition to the propensity score. Any treated units that don't have matches that satisfy the exact and caliper restrictions will be dropped.

waynelapierre commented 3 years ago

Thanks so much. I just want to make sure that the variables supplied to the exact argument do not have to be in the variables used for matching. For example, matchit(y ~ x, exact = ~z + h, data = data) will work.

ngreifer commented 3 years ago

My understanding is that you aren't using any variables for matching except the propensity score, which is supplied to distance. The variables in the main formula are used solely for balance checking with summary() but will not affect the match if you provide already-estimated propensity scores to the distance argument (unless you're using genetic matching).

The variables in exact and caliper just need to be in the dataset supplied to data and don't need to be specified anywhere else, so that example you provided should work fine as long as z and h are in data.

waynelapierre commented 3 years ago

OK. Thanks for the clarification.

waynelapierre commented 3 years ago

It seems that when I supply machit's distance with a feglm model fitted value. Setting caliper to 0.1 and std.caliper to FALSE does not drop the matched observations with a distance higher than 0.1. Is this a bug?

ngreifer commented 3 years ago

You need to provide more information for me to help you. Please provide your code and the results that you think are in error and I can try to assess.

waynelapierre commented 3 years ago

My bad, I mistyped it. Thanks again for the great package and help!

kosukeimai / MatchIt

very slow when using fixed effects #57