insongkim / PanelMatch

117 stars 35 forks source link

Issues with get_covariate_balance #44

Closed Lila333 closed 4 years ago

Lila333 commented 5 years ago

I have a question/suggestion regarding "get_covariate_balance": For some variables in my dataset, the function does not plot a line in the graph, and the results are given out as “Inf”, “-Inf”, or ”NaN”. For all other variables, it works fine. A close inspection of the data indicates that this only affects dummy variables with very little or no pre-treatment variation. Mathematically and substantively, this makes sense to me. Do you think this makes sense? Perhaps this is sth that should be flagged to users? Thanks.

adamrauh commented 5 years ago

It sounds like there might be something where we end up dividing by zero somewhere accidentally if there's no variance. Based on a quick look at the code, I think that's a distinct possibility. Do you have an example? That's definitely something we should flag if so

Lila333 commented 5 years ago

An example would be a dummy variable which, for all specified pre-treatment years, equals zero in all matched pairs for the treatment and control units. This seems a clear case. Another example seems to be where only one matched pair has a (control or treatment) unit-year which equals one (everything else equals zero).

RobbyMax1999 commented 5 years ago

Are there any updates on this? Thanks.

adamrauh commented 5 years ago

If someone could provide a bit of code that recreates the error exactly, that would be helpful for diagnosis.

RobbyMax1999 commented 5 years ago

An example would be this (with prior matching on missings; dummy3 being the variable with little to no variation between treated and controls):

get_covariate_balance(PM.results$att, dataset, covariates = c("dummy1", "dummy2", “cont1”, "dummy3”), verbose = T, plot = F)

 dummy1   dummy2   cont1      dummy3  

t_1 1.48450 0.12896 -0.25891 -Inf t_0 1.60396 0.36792 -0.27894 NaN

RobbyMax1999 commented 5 years ago

Was this example helpful? Thanks again.

adamrauh commented 5 years ago

Could you set up the example so that I can just copy/paste the code to recreate the error? Sorry. That would help a lot.

RobbyMax1999 commented 5 years ago

Ok there might be a misunderstanding. The problem is not the code, but that it is unclear how the covariate balance is calculated if there is very limited variation in binary variables. So whether you can reproduce the error (if this is indeed an error) or not depends on the data. But if you just want to copy paste the code, you can copy the below and plug in your variables/data:

get_covariate_balance(PM.results$att, dataset, covariates = c("dummy1", "dummy2", “cont1”, "dummy3”), verbose = T, plot = F)

If this still does not help, can you be more specific re what exactly it is that you need (e.g., more of the overall code)? Thanks again.

RobbyMax1999 commented 5 years ago

Perhaps to summarize more concisely, the question is whether results like NaN make sense or not? Adam suggested above that "there might be something where we end up dividing by zero somewhere accidentally if there's no variance."

adamrauh commented 5 years ago

Perhaps I've overcomplicated things here :-)

I was hoping you could provide me an easily reproducible example, including the data, so that I don't have to spend time generating some data to recreate the error you're seeing. If you can provide me with data and code that triggers this error, this ensures that I am able to see the same error you see, plus it would make it easier for me to diagnose. Does that make sense?

I'm pretty sure the fix will be simple once I find the problem -- I just need to add in a special case to make sure we aren't dividing by 0 or NA.

RobbyMax1999 commented 4 years ago

I can't share data here but this should be super quick to create and diagnose with the following steps: 1) create a new dummy variable with very limited variation and add it to an existing dataset. 2) include this new variable in the covariate list to refine your matched sets. Once you did this, you can check the results after running "get.covariate.balance." If in the matched set, you end up having no variation (e.g., only 0s or 1s in all treated and control units in all relevant years), you should get NaN for this variable; if you have very limited variation (e.g., only one unit deviates for one year), you should get +/-Inf. (Or at least that is what seems to be going on.) This should help to quickly reproduce the results reported in this thread, and to diagnose the issue. Let me know if this works.

Lila333 commented 4 years ago

As outlined at the beginning of this thread, these results make sense to me, but it would be good to know for sure from the program developers, esp. since this seems to be a common occurrence.

adamrauh commented 4 years ago

I'll be attempting to address this today @Lila333 @RobbyMax1999

adamrauh commented 4 years ago

I believe I just pushed up a patch for this. Instead of showing NAs/NANs etc, those variables will just be dropped and it should display a warning to users saying which variables were removed due to the lack of variation. Try updating and let me know if this solves the problem. @Lila333 @RobbyMax1999

Lila333 commented 4 years ago

This works for me! Thanks!