MatchIt on Big Data - Horizontal data split on exact variables

kosukeimai / MatchIt

R package MatchIt

210 stars 43 forks source link

MatchIt on Big Data - Horizontal data split on exact variables #66

Closed nchemine closed 3 years ago

nchemine commented 3 years ago

I am using MatchIt on big data (2 mio records), so it does not run in one go. I need to split up my dataset into subsets (based on values of my exact variables) and run iterations of MatchIt on these subsets to make it work. It would be really great if MatchIt could do that automatically (or allow for users to specify that) - that would be an immense improvement in terms of computing efficiency!

Also, I am looking for a way to re-combine all my different MatchIt outputs (each run on a subset of my data) and do not know how to do that - any help would be greatly appreciated!!

Thank you for your great work!

ngreifer commented 3 years ago

So funny you asked, both of these features are available in the development version of MatchIt on my GitHub. With nearest neighbor matching, when an argument to exact is specified, matching takes place seperately within each level of the exact matching variables, which will speed up execution. I have plans to allow this to be parallelized, too, but I haven't implemented that yet. Also, if you match separately within subgroups (i.e., using different calls to matchit()), you can now use a special rbind() method to bind the several match.data() outputs together into a single output for effect estimation. You do still have to assess balance within each matchit object separately, though, as these cannot be combined. You can be clever and use the rbind() output with cobalt if you retain the unmatched units in the match.data() calls.

To install the devlopment version on my GitHub, you can run devtools::install_github("ngreifer/MatchIt").

nchemine commented 3 years ago

Hello Noah,

This is amazing!! Thank you so much! I have tried it this morning and it has worked brilliantly indeed. I am just wondering what the implications are of using this development version. I am doing professional biomedical data analysis and need to be sure the results are correct. Is there anything I should be aware of? When do you plan to release this feature into the "publicly available" version? Thank you very much again and kind regards,Nathalie Op vrijdag 30 april 2021 05:04:10 CEST schreef Noah Greifer @.***>:

So funny you asked, both of these features are available in the development version of MatchIt on my GitHub. With nearest neighbor matching, when an argument to exact is specified, matching takes place seperately within each level of the exact matching variables, which will speed up execution. I have plans to allow this to be parallelized, too, but I haven't implemented that yet. Also, if you match separately within subgroups (i.e., using different calls to matchit()), you can now use a special rbind() method to bind the several match.data() outputs together into a single output for effect estimation. You do still have to assess balance within each matchit object separately, though, as these cannot be combined. You can be clever and use the rbind() output with cobalt if you retain the unmatched units in the match.data() calls.

To install the devlopment version on my GitHub, you can run devtools::install_github("ngreifer/MatchIt").

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ChristelSwift commented 3 years ago

Hi Noah, is this new feature implemented from MatchIt 4.2.0? I'm trying optimal pair matching with exact matching on some selected variables, and mahalanobis distance on other variables.

I can't get it to work with the full dataset which has about 500k rows (of which about 150k are in the treated group), i keep running out of memory.... Do i have to use a specific syntax for the matching to take place separately in each level of the exact matching variables? I'm currently using something like this:

match1 <- matchit(
  experiment_group ~ .,
  method = "optimal", 
  mahvars = ~ age + children + comedy + drama + ents + factual + learning + music + news + sport , 
  exact = c("gender", "acorn_category", "hf"),
  data = db
)

ngreifer commented 3 years ago

Yes, that is a feature in 4.2.0, but only for nearest neighbor matching. Exact matching with optimal matching is handled by the optmatch package, which may not be able to handle such a large dataset. The ability to handle large datasets is an advantage of nearest neighbor over optimal matching. By setting verbose = TRUE with method = "nearest" you can also track the progress within each category of the exact variables. In general, nearest neighbor and optimal matching yield similar results, so you aren't losing anything by using nearest neighbor.

ChristelSwift commented 3 years ago

thank you so much for such a prompt reply, i'll try nearest!