Closed ginnydang closed 10 months ago
Hi Ginny,
The speed depends on which matching method you are using (e.g., nearest neighbor, optimal, CEM, etc.) and the distance measure you are using (e.g., propensity scores, Mahalanobis distance, etc.). Please provide this information and I can investigate this for you. The fastest matching algorithms are CEM, generalized full matching (method = "quick"
), and subclassification. The slowest are usually optimal full matching (method = "full"
) or cardinality matching.
I'm starting with nearest neighbor right now and it takes 30 mins. Thank you so much for prompt support! Much appreciated!
NN matching is a fundamentally slow method. There is almost nothing you can do to speed it up, sorry. One thing you might do is add exact matching constraints using the exact
argument, which separates the matching problem into several smaller ones. I recommend instead, with a dataset that large, to use generalized full matching, which on my machine takes only about 4 seconds with a dataset of 400,000 units, or subclassification, which takes about 3 seconds. These methods tend to outperform nearest neighbor matching statistically as well.
That’s awesome! Let me try those ones you suggest and see how it works.
Thank you so much!
It works well for me with Generalized Full Matching. I also would like to keep my sample size intact so I think that's the best option. Besides I have one quick question, when I tried Optimal Full Matching, it returned: "vector memory exhausted (limit reached?". I tried Parallel processing and my R installation version is 64 bit. Could you think of any reasons why the memory is limited while I didn't use that much. When I used Nearest, it only took some time, it didn't return error related to memory.
Much appreciated!
Optimal full matching is done by optmatch
, and that error comes from that package. The reason for the error is that optmatch
requires an N1 x N0 matrix, which could be up to 10 billion values for a dataset of that size. That is too big an object for R to store in memory. NN matching doesn't create that matrix specifically to avoid that issue.
I got it. Thank you for your big help!.
So for Generalize Full Matching, it only takes me about 2 second when I matched it but when I wanted to see Summary table, it takes so much time that I haven't seen the result yet. Is there any way around that I could see that table?
Yes, the slow part is calculating the pairwise differences. You can turn that off by setting pair.dist = FALSE
. It will still take a few seconds.
You save the day!. It works beautifully!
I hope this is the last question. The plot(summary) takes some time too. I think it encounter the same issue and Do you have any tip or trick for this kind of plot? This plot is the best to demonstrate the Summary table. I tried Jitter and it works just fine.
Did you also set pair.dist = FALSE
in the call to summary()
? E.g., plot(summary(m.out, pair.dist = FALSE))
This should not take any longer than just creating the summary table. Making a jitter plot, etc., uses a different function (i.e., running plot()
directly on a matchit
object); I found this to be slow for QQ plots and ECDF plots but not for density plots.
I tried it yesterday and it didn't work but it Works now. Thank you so much!
I have dataset with up to 400,000 entries and I'm using MatchIt to match 2 groups. It's very slow and I tried different ways (Parallel Processing...) to speed up the process but I couldn't see the matched result yet. Could you please suggest any idea to speed up the process using MatchIt with large dataset? Should I use certain parameters for matching as well?
Thank you so much!