Using PanelMatch with large DFs

Dear colleagues,

thank you for providing such an innovative public good to the community. I am researching how FDI projects affect local nighttime development. I have seen your answers on issues #53 and #46. Working with >15 million rows, I am struggling with memory issues and system abort errors (even though I work on a 500GB RAM cloud with 20 nodes) and I hope that you can help me to overcome those:

Let me provide a bit more background of the data: I have divided the world into raster cells (~900k) and for each of these cells, I got 17 years of observation (2002 - 2018): How the light pollution developed ("lights"), whether a cell has been treated in the same year ("treatment"), how much FDI they received ("fdi_volume"). Moreover, I control for the population size ("hyde") in this raster. There are many cells that have never been treated and the distribution of FDI projects is extremely uneven.

Here is a small reproducible example: `library(tidyverse)

year= as.numeric(c(2002:2018)) country= c("AFG","ALB","Country") project_num= c(1:5) treatment=sample(c(0,1), 255,replace=TRUE)

set.seed(1000) lights=runif(255, 1, 63) hyde=runif(255,1000, 200000) fdi_volume =runif(255,1, 200)

dt<- merge(year,country) %>% dplyr::rename(year=x, country=y) dt<- merge(dt,project_num) %>% dplyr::rename(project_num=y) %>% mutate(id=paste(country,project_num,sep="-")) dt<- cbind(dt,treatment) dt<- cbind(dt,lights) dt<- cbind(dt,hyde) dt<- cbind(dt,fdi_volume)`

What solutions do you have discovered to work with large datasets? I found that Mahalanobis treatment matching worked under specific circumstances, while propensity score matching and weighting always failed. I tried to find workarounds by splitting the sample or writing a loop but I haven´t yet come up with a sufficient solution (I read your wiki on Matched Set Objects).

Alternatives In case there is no loop, there might be another workaround: So far, I am including the country as a covariate. One idea could be to divide the dataset by countries and run the panel match individually. But here I am having doubts:

First, isn´t it too biasing for my results when I do not take into account treatment cells from other countries?
Second, given that the distribution of treated cells is extremely uneven, I have some countries that only find really small numbers of matched sets, while in bigger countries there are many. Would a divide by country influence the robustness in smaller countries due to missing matches?
Third, how can I calculate the average treatment effect when I have countries with different numbers and volumes of FDI projects and sizes? Should I just take the average of all effects or what would be the most sophisticating strategy (similar question to issue #75?

If you allow me, let me post a few more questions here instead of starting new issues:

Have there been any development on issue #61? I do have the FDI volume for each project and it would certainly make it more robust to take this instead of a treatment dummy.
I do have a democracy variable that I wanted to include in the covariates, but given the high collinearity, it seems to result in errors. Do you agree that it´s reasonable to exclude the dummy given that regime shifts barely took place between 2002-2018 and that the information thus is included in the country dummy?
Choosing the lead, lag, and size.match still feels a bit arbitrary to me. I can surely understand that you don´t want to communicate one-fits-all answers, but have you developed any additional guidelines/rules of thumb that would satisfy polsci reviewers? Otherwise, I will just run the function for several lags and leads and combine the treatment effects as point estimate lines with CI in a ggplot.

insongkim / PanelMatch

Using PanelMatch with large DFs #89