lrberge / fixest

Fixed-effects estimations
https://lrberge.github.io/fixest/
361 stars 59 forks source link

[feature request] Add informative error message/crash on nonidentification in twfe #469

Closed CetiAlphaFive closed 4 months ago

CetiAlphaFive commented 4 months ago

Is it possible to add an informative flag, or crash, in circumstances described in this paper (Kropko and Kubinec 2020)(associated blog)? It seems this is a general problem with R FE implementations that's worth addressing.

lrberge commented 4 months ago

Hi, thanks for raising the issue but I disagree strongly with the authors' point. There is no issue at all in R. They are the ones writing quite some bold, and obviously wrong statements. Let's give an example.

This comment in the blog about automatic removal of collinear variables is wild:

At this point, you should be concerned. One of our frustrations with this piece is that although the working paper has been in circulation for years, no one seems to have cared about this apparently quite important problem.

And yes, they talk about their working paper solving this almighty problem. Why didn't R developers, after reading their paper, immediately change their software? I can only wonder.

Statistical tools don't have to solve specification problems

To put it simply, the authors blame the statistical tools for not finding out when the user's econometrics model is misspecified. Hence this is equivalent to blaming the hammer when you hit your hand, or blaming the racket when you play poor tennis.

Why it is not desirable to error on collinearity problems

The user is in charge of making sure its variable of interest is not in a collinear system... that's the minimum! If this job is done, then the automatic removal of collinear variables is a desirable feature. Simply put, most of the time it only affects the controls (in general multiple overlapping factor levels), and it has no impact whatsoever on the variable of interest. Errors due to collinearity can lead in practice to the impossibility to estimate the model, because teasing out collineariy from your controls beforehand can be challenging (imagine merging manually the levels of a factor variable...).

The only case when it would be interesting is when the user hasn't done his job and checked the model wasn't misspecified beforehand. Ideally we would like:

  1. variable of interest in a collinear system => error or explicit message
  2. collinearity not affecting the variable of interest => OK

The problem is that the software cannot know if we're in 1) or 2). Assuming always 1) would penalize users who know what they do at the the benefit of those who don't.

What happens in fixest

Had you tried their example (from the blog) on fixest, that's what you would have obtained:

feols(y ~ x | case + time, gen_data$data)
#> Error: in feols(y ~ x | case + time, gen_data$data): 
#> The only variable 'x' is collinear with the fixed effects. In
#> such circumstances, the estimation is void.

Hence it's covered here.

Note on the tone

I'm sorry for the stinging tone, but reading the finger pointing at software tools combined to the brazenness almost choked me to death.

CetiAlphaFive commented 4 months ago

Thanks for getting back on this. I have to say, I don't appreciate the tone of this reply despite your note.

A few things: 1) It's not a working paper. 2) I did try the blog examples using fixest and saw the collinearity error. The authors would claim (as they did to me directly!) the error isn't informative here, since the issue isn't really the collinearity but that the matrix is less than full rank.

I'm a fan of this package, was just trying to improve it. (Last time I'll make that mistake, sheesh.) I'm not saying R packages need to address all identification concerns. Making error messages clearer for end users seems like a weird thing to have such a strong reaction to.

lrberge commented 4 months ago

Thanks for replying, and please accept my sincere apologies if my words hurt you.

Offense there was

Something I have to clarify is that I felt deeply offended (really) by the works, and words, you directed me to. Basically what the authors say is that software developpers are morons. I simply couldn't accept that silently (this was too much).

What I tried to explain in my reply is that software is used by a large community of persons with different backgrounds and different objectives. There are always tradeoffs, in favor of specific needs and against others. The authors disregarded these tradeoffs completely.

OSS

On contributions to OSS, of course I don't want to discourage you!!!

But if you allow me, I would give a single piece of advice: Something you must absolutely keep in mind is that written communication is rife with misunderstanding. The only way to make inferences on intentions is through what is written.

For example, from what is written in the original post, in all likelihood it looks like you were someone who had seen a blog, found it interesting, and then opened an issue. So, from my perspective, what I see is someone who took 5min to write an issue which gives me over 2h of work (well above 3 now...). Because to do it well, I would need to go through all the material, find out exactly what you had in mind, implement it, motivate my choices of implementation, etc.

Further, how could I possibly know that you had run the code and found the error message inadequate? When the critique in the blog was about the possibility to make mistakes without noticing it... now an error, which puts the problem front and center, isn't enough? The only inference I could draw was that you didn't even try on the software. Here I was wrong, but that was the best inference from the information I had.

Alternative original post

Here is an example of an alternative writing of the original post:

Hello, I have seen some authors argue about how to report model misspecification in the context of two way fixed-effects models (see, e.g. here and there). Differently from the original critique, which was about the possibility to obtain wrong results without noticing, fixest does report an error. However, it seems to me that error/warnings could be more informative.

In the context of TWFE, take the example of an estimation where the variable of interest, say x, is perfectly explained by the individual and time fixed-effects.

If I estimate the model in fixest the variable gets removed and I obtain an error message:

# remotes::install_github("saudiwin/panelsim")
require(panelsim)
gen_data = tw_data(N = 50, T = 50, case.eff.mean = 1, cross.eff.mean = -3,
                   cross.eff.sd = 0, case.eff.sd = 0, noise.sd=.25)

feols(y ~ x | case + time, gen_data$data)
#> Error: in feols(y ~ x | case + time, gen_data$data): 
#> The only variable 'x' is collinear with the fixed effects. In
#> such circumstances, the estimation is void.

I think that this message might not be clear enough, what about: "place your message here"

What do you think, are there any drawbacks to it?

Back on the original post

To be honest I don't see the problem with the error message.

PS: Note that collinearity and rank deficiency is one and the same thing, so I don't understand your point 2. There may be different vocabularies...

grantmcdermott commented 4 months ago

Against my better judgement, I feel compelled to weigh in here as a third party to this conversation...

@CetiAlphaFive sorry that you felt put out and I hope it doesn't put you off asking questions and contributing to the community in the future. That would be a bad outcome for all parties. At the same time I genuinely don't think @lrberge was being unfair in his response, especially when you consider the time investment taken to respond to these sorts of outside complaints. (It's hard to gauge the time burden until you've maintained a large/complex OSS project.) Personally, I read any ornery tone in Laurent's initial reply as being directed at the original post and, frankly, that's fair game given the aggressiveness of the post in question.

Two more points:

CetiAlphaFive commented 4 months ago

To close the loop here, I certainly regret that you were offended and for causing such a kerfuffle. This was not my intention with the original post. I'm also sorry you spent so much time on the reply. I'm not entirely sure what I could've done about that but nonetheless. If it had turned out you agreed that this issue should be addressed I suspect you would not feel the time would have been a waste, but maybe I'm wrong. I'm also happy to stipulate I could've included more information in the initial issue.

I was put out by the reply because I submitted the issue in earnest. Of course you have every right to react however you want to the blog/paper. I'm not the author, however, I think your indignation might be more rightly directed at him/them. But whatever, it's water under the bridge.

On the substantive issue, I agree the aggressiveness of the blog (i.e., blaming the software not the modeler) isn't helpful. But I'm less convinced as a general matter that packages can't be useful in highlighting specification issues. For example, grf has loads of specification/id related error messages that seem blandly helpful. Obviously the software can't always bail users out, but sometimes it can. In this specific case, I'm going to resist the temptation to reopen the can of worms. If you want to pursue this further, the proofs on 16-17 of the article go further into it (tldr the issue in their view is that the multicollinearity checks can actually mask fixed or near fixed within-case + within-time slopes).

Anyway, thanks for the replies.