bal.tab slow on large dataset (MatchIt)

ngreifer / cobalt

Covariate Balance Tables and Plots - An R package for assessing covariate balance

https://ngreifer.github.io/cobalt/

73 stars 11 forks source link

bal.tab slow on large dataset (MatchIt) #44

Closed willjobs closed 4 years ago

willjobs commented 4 years ago

Hi there. I'm running bal.tab on the results of a MatchIt run on a dataset of about 120,000 rows. The MatchIt process took about 2 hours to run, producing a matchit object about 229 Mb in size. I tried running bal.tab as follows:

baltab <- bal.tab(m.out1, m.threshold=0.1, binary="std")

and it's taking a long time (still running, currently at over an hour). I was able to run a practice example from the documentation on the lalonde dataset, and that worked fine. I also was able to run matchit on a sample of 2000 rows and ran bal.tab on that (which took about 5 seconds). So I'm confused about why this is taking so long.

I am using R 3.5.1, MatchIt version 3.0.2, and cobalt version 3.6.1.

Thank you!

Edit: I killed the R session after > 4 hours of running, it never seemed to finish.

ngreifer commented 4 years ago

Hi, sorry for not get back to you earlier. I'm definitely looking into speeding up cobalt, perhaps with C. That said, if you would, would you mind updating to the newest version of cobalt and trying again? The version you were using is over a year and a half years old and I've made many performance upgrades since then.

willjobs commented 4 years ago

Hi Noah, no problem. I did update cobalt and it appeared to help. I think the biggest help, though, was due to what I think is a bug in MatchIt. When I ran the matchit function, I was giving it the full dataset with a few hundred columns, but I had already calculated the propensity score on my own. So, the matchit function didn't need all the covariates, it only needed the propensity score and some kind of ID column. Once matchit finished running, the resulting MatchIt object was much smaller in size than before (something like 120 MB --> 20 MB). Then I could run cobalt's bal.tab function on this matchit object, and pass in the other covariates with the addl parameter, and this was able to complete.

I may be forgetting some details (or have misspoken on some); this was a project I was working on back in July and don't have access to anymore. Hope this all helps!

ngreifer commented 4 years ago

Hi WIll, glad to hear you were able to solve it. I don't know if you know, but I'm in the process of a massive update of MatchIt (rewriting it basically from scratch), so any other feedback you have on it would be useful. Working with large datasets is definitely one of its weaknesses. If there are any features that you would like to see in it please let me know.

willjobs commented 4 years ago

Oh that's really interesting! Are you working with the original authors?

One feature I ended up coding myself was a way to summarize categorical variables using the median coefficient standardized difference across all its levels. That made the resulting loveplot I made a lot easier to interpret.

ngreifer commented 4 years ago

I am. I'm a postdoc for Liz Stuart. She got permission from the other authors for me to update it. You can check it out on my GitHub if you want to play around with it while I'm developing it. It's basically done; I'm just working on vignettes and trying to implement Rcpp.

That's an interesting idea. I agree that summarizing balance on categorical variables with many levels can be cumbersome. My arch nemesis Kazuki Yoshida (just kidding, I respect him greatly) uses a single-value statistic to summarize balance on categorical variables in his tableone package, which is very similar to cobalt (the tables it produces are actually much nicer). I choose to leave a balance statistics for each category because the bias in the effect estimate depends on the imbalance in each category