insongkim / PanelMatch

111 stars 34 forks source link

Using PanelMatch with large DFs #89

Closed LuMesserschmidt closed 2 years ago

LuMesserschmidt commented 2 years ago

Dear colleagues,

thank you for providing such an innovative public good to the community. I am researching how FDI projects affect local nighttime development. I have seen your answers on issues #53 and #46. Working with >15 million rows, I am struggling with memory issues and system abort errors (even though I work on a 500GB RAM cloud with 20 nodes) and I hope that you can help me to overcome those:

Let me provide a bit more background of the data: I have divided the world into raster cells (~900k) and for each of these cells, I got 17 years of observation (2002 - 2018): How the light pollution developed ("lights"), whether a cell has been treated in the same year ("treatment"), how much FDI they received ("fdi_volume"). Moreover, I control for the population size ("hyde") in this raster. There are many cells that have never been treated and the distribution of FDI projects is extremely uneven.

Here is a small reproducible example: `library(tidyverse)

year= as.numeric(c(2002:2018)) country= c("AFG","ALB","Country") project_num= c(1:5) treatment=sample(c(0,1), 255,replace=TRUE)

set.seed(1000) lights=runif(255, 1, 63) hyde=runif(255,1000, 200000) fdi_volume =runif(255,1, 200)

dt<- merge(year,country) %>% dplyr::rename(year=x, country=y) dt<- merge(dt,project_num) %>% dplyr::rename(project_num=y) %>% mutate(id=paste(country,project_num,sep="-")) dt<- cbind(dt,treatment) dt<- cbind(dt,lights) dt<- cbind(dt,hyde) dt<- cbind(dt,fdi_volume)`

What solutions do you have discovered to work with large datasets? I found that Mahalanobis treatment matching worked under specific circumstances, while propensity score matching and weighting always failed. I tried to find workarounds by splitting the sample or writing a loop but I haven´t yet come up with a sufficient solution (I read your wiki on Matched Set Objects).

Alternatives In case there is no loop, there might be another workaround: So far, I am including the country as a covariate. One idea could be to divide the dataset by countries and run the panel match individually. But here I am having doubts:

If you allow me, let me post a few more questions here instead of starting new issues:

LuMesserschmidt commented 2 years ago

To give a brief follow up on my case:

I looped the PanelMatch function by country (as described above) and calculated the treatment effect for every country. I then calculated the pooled mean and variance (https://www.ncbi.nlm.nih.gov/books/NBK56512/). This has partially inflated standard errors but effect estimates are nearly the same. Do you have any opinions on whether this looping violates any of your model assumptions?

Thanks!