kosukeimai / MatchIt

R package MatchIt
212 stars 41 forks source link

Large Dataset with MatchIT #184

Closed ginnydang closed 10 months ago

ginnydang commented 10 months ago

I have dataset with up to 400,000 entries and I'm using MatchIt to match 2 groups. It's very slow and I tried different ways (Parallel Processing...) to speed up the process but I couldn't see the matched result yet. Could you please suggest any idea to speed up the process using MatchIt with large dataset? Should I use certain parameters for matching as well?

Thank you so much!

ngreifer commented 10 months ago

Hi Ginny,

The speed depends on which matching method you are using (e.g., nearest neighbor, optimal, CEM, etc.) and the distance measure you are using (e.g., propensity scores, Mahalanobis distance, etc.). Please provide this information and I can investigate this for you. The fastest matching algorithms are CEM, generalized full matching (method = "quick"), and subclassification. The slowest are usually optimal full matching (method = "full") or cardinality matching.

ginnydang commented 10 months ago

I'm starting with nearest neighbor right now and it takes 30 mins. Thank you so much for prompt support! Much appreciated!

ngreifer commented 10 months ago

NN matching is a fundamentally slow method. There is almost nothing you can do to speed it up, sorry. One thing you might do is add exact matching constraints using the exact argument, which separates the matching problem into several smaller ones. I recommend instead, with a dataset that large, to use generalized full matching, which on my machine takes only about 4 seconds with a dataset of 400,000 units, or subclassification, which takes about 3 seconds. These methods tend to outperform nearest neighbor matching statistically as well.

ginnydang commented 10 months ago

That’s awesome! Let me try those ones you suggest and see how it works.

Thank you so much!

ginnydang commented 10 months ago

It works well for me with Generalized Full Matching. I also would like to keep my sample size intact so I think that's the best option. Besides I have one quick question, when I tried Optimal Full Matching, it returned: "vector memory exhausted (limit reached?". I tried Parallel processing and my R installation version is 64 bit. Could you think of any reasons why the memory is limited while I didn't use that much. When I used Nearest, it only took some time, it didn't return error related to memory.

Much appreciated!

ngreifer commented 10 months ago

Optimal full matching is done by optmatch, and that error comes from that package. The reason for the error is that optmatch requires an N1 x N0 matrix, which could be up to 10 billion values for a dataset of that size. That is too big an object for R to store in memory. NN matching doesn't create that matrix specifically to avoid that issue.

ginnydang commented 10 months ago

I got it. Thank you for your big help!.

ginnydang commented 10 months ago

So for Generalize Full Matching, it only takes me about 2 second when I matched it but when I wanted to see Summary table, it takes so much time that I haven't seen the result yet. Is there any way around that I could see that table?

ngreifer commented 10 months ago

Yes, the slow part is calculating the pairwise differences. You can turn that off by setting pair.dist = FALSE. It will still take a few seconds.

ginnydang commented 10 months ago

You save the day!. It works beautifully!

ginnydang commented 10 months ago

I hope this is the last question. The plot(summary) takes some time too. I think it encounter the same issue and Do you have any tip or trick for this kind of plot? This plot is the best to demonstrate the Summary table. I tried Jitter and it works just fine.

ngreifer commented 10 months ago

Did you also set pair.dist = FALSE in the call to summary()? E.g., plot(summary(m.out, pair.dist = FALSE))

This should not take any longer than just creating the summary table. Making a jitter plot, etc., uses a different function (i.e., running plot() directly on a matchit object); I found this to be slow for QQ plots and ECDF plots but not for density plots.

ginnydang commented 10 months ago

I tried it yesterday and it didn't work but it Works now. Thank you so much!