sergiocorreia / reghdfe

Linear, IV and GMM Regressions With Any Number of Fixed Effects
http://scorreia.com/software/reghdfe/
MIT License
214 stars 56 forks source link

bug saving the fixed effects #52

Open olidess opened 8 years ago

olidess commented 8 years ago

Dear Sergio,

I have been using the two following syntaxes to estimate and save the FE coefficients

A) reghdfe r, a(i.id i.id#c.rmrf id#c.smb id#c.hml, savefe) B) reghdfe r, a(i.id##c.(rmrf smb hml),savefe)

Where id is a categorical variable, and all the other are continuous variables.

If my understanding is correct, the two syntaxes should be equivalent. I.e the FE estimates should be the same. But I find different alphas. The slope estimates are the same ( _hdfe2_slop1 using syntax A = _hdfe1_slop1 using syntax B), but not the alpha estimates ( hdfe1 using syntax A <> hdfe1 using syntax B)

Note that the correct alpha seems to be obtained with syntax A only. When I do

reghdfe r if id==1, a(i.id i.id#c.rmrf id#c.smb id#c.hml, savefe)

and

reg r rmrf smb hml if id==1

I find that hdfe1 = _cons, which is what I expected. However, when I do

reghdfe r if id==1, a(i.id##c.(rmrf smb hml),savefe)

I have hdfe1 different than _cons

Is it a bug in the command or I am missing something?

Thank you so much for your help !

Best,

OD

sergiocorreia commented 8 years ago

Hi Oli,

Both syntax A and B should give the same answer for the betas (e.g. if you run "reghdfe y x, a(..)", then the estimate for x should be unchanged). However, there is no guarantee that the individual intercepts and slopes (the alphas) will be the same between syntaxes. The reason for this is that often there can't actually be recovered, as the parameters are not identified.

The working assumption that reghdfe does for the intercepts is that it returns a variable hdfe1 with mean zero. There is some discussion here: http://scorreia.com/software/reghdfe/faq.html#where-is-the-constant.

For a simple example of why we do this ,think of a regression like "regress y x, a(fe1 fe2)". In this case, the two sets of fixed effects are collinear. The usual solution is to drop some of the dummies, but we can't do that because we are demeaning. The areg solution is to add back the constant, but since I can have more than one set of FEs, then the question would be which of the two FEs receives the constant.

All in all, I am including the option to save the FEs because many people seem to use it, but there are a lot of nuances in how to use and interpret them (although your case is quite straightforward).

Perhaps I should allow an option to give the same alphas as regress in a case like yours?

olidess commented 8 years ago

Hi Sergio,

Thank you for your answer (And thank you for the reghdfe command as well ☺, it is great … I love it !!).

I see the problem. But I still find a bit confusing here that the two syntaxes yield exactly the same betas, the same individual slopes (individual slopes are really identical) but very different individual intercepts. Note also that the two syntaxes also lead to the same residuals when using the option residuals as an alternative to savefe. If the residuals, the betas, and the individual slops are the same, then by construction, I would expect the individual intercepts to be the same as well.

Let me elaborate more on this issue. Suppose you want to run an event study with many events and that expected stock returns are generated by the following model:

R = Alpha + Beta1 x Var1 + Beta2 x Var2 + Beta3 x Var3

The traditional approach to compute abnormal returns is to estimate the parameters alpha, beta1, beta2, beta3 by running separate regressions for every event. In stata, this can be done with the following command

Statsby _b, by(event) : reg R Var1 Var2 Var3

But if you have many events (more than 10000 as is the case for me), this takes forever!

With your command, we can directly estimate all parameters (Alpha, Beta1, Beta2, Beta3) for every event by running

reghdfe R , a( i.event##c.( Var1 Var2 Var3), savefe)

Beta1 for every event is recorded in _hdfe1_slop1 , Beta 2 is in _hdfe1_slop2, Beta 3 is in _hdfe1_slop3. I have compared the two approaches and the Betas are indeed the same. The only problem is that what is recorded in __hdfe1 (i.e. the individual intercepts for every event) are not the Alpha estimates using the traditional approach

Note that this is solved by using the alternative syntax

reghdfe R , a( i.event i.event ##c.Var1 i.event ##Var2 i.event ##Var3, savefe)

In this case, the individual intercepts for every event recorded in __hdfe1 correspond to the Alpha estimates using the traditional approach.

The problem is that this alternative syntax is not as fast as the first one. The first one is much much faster (it takes 10 secondes).

One way to get the correct Alpha using the first syntax is to do

reghdfe R , a( i.event##c.( Var1 Var2 Var3), savefe) reghdfe R , a( i.event##c.( Var1 Var2 Var3)) residuals(resid)

The correct Alpha is obtained by doing : R-Resid- Beta1* Var1- Beta2* Var2- Beta3* Var3

I see the problem you have when there are multiple FEs, but is there a way to get the same individual intercepts estimates across all syntaxes when there is only one set of FE?

Thank you so much !

Olivier

PS: Again, your command is great. I really like it.

From: Sergio Correia [mailto:notifications@github.com] Sent: May-26-16 2:07 AM To: sergiocorreia/reghdfe reghdfe@noreply.github.com Cc: Olivier Dessaint Olivier.Dessaint@Rotman.Utoronto.Ca; Author author@noreply.github.com Subject: Re: [sergiocorreia/reghdfe] Bug? (#52)

Hi Oli,

Both syntax A and B should give the same answer for the betas (e.g. if you run "reghdfe y x, a(..)", then the estimate for x should be unchanged). However, there is no guarantee that the individual intercepts and slopes (the alphas) will be the same between syntaxes. The reason for this is that often there can't actually be recovered, as the parameters are not identified.

The working assumption that reghdfe does for the intercepts is that it returns a variable hdfe1 with mean zero. There is some discussion here: http://scorreia.com/software/reghdfe/faq.html#where-is-the-constant.

For a simple example of why we do this ,think of a regression like "regress y x, a(fe1 fe2)". In this case, the two sets of fixed effects are collinear. The usual solution is to drop some of the dummies, but we can't do that because we are demeaning. The areg solution is to add back the constant, but since I can have more than one set of FEs, then the question would be which of the two FEs receives the constant.

All in all, I am including the option to save the FEs because many people seem to use it, but there are a lot of nuances in how to use and interpret them (although your case is quite straightforward).

Perhaps I should allow an option to give the same alphas as regress in a case like yours?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHubhttps://github.com/sergiocorreia/reghdfe/issues/52#issuecomment-221785620

sergiocorreia commented 8 years ago

I agree that it is confusing. All in all, in this case the entire difference lies in the constant, but adding it back by default is usually quite messy so I chose not to (although in previous versions of reghdfe there was a reported _cons coefficient).

As a solution, perhaps we could have a suboption that adds back the constant to the first set of intercepts?

Something like reghdfe R , a(i.event##c.( Var1 Var2 Var3), savefe keepconstant)

It would basically do what you are currently doing with the residuals() option but behind the scenes...

Let me know if your current workflow works with the residuals() workaround, and if so I'll try to add it for the next version of reghdfe (~ 1 month or so).

olidess commented 8 years ago

Ok ! Thanks!

Olivier

From: Sergio Correia [mailto:notifications@github.com] Sent: May-26-16 1:05 PM To: sergiocorreia/reghdfe reghdfe@noreply.github.com Cc: Olivier Dessaint Olivier.Dessaint@Rotman.Utoronto.Ca; Author author@noreply.github.com Subject: Re: [sergiocorreia/reghdfe] Bug? (#52)

I agree that it is confusing. All in all, in this case the entire difference lies in the constant, but adding it back by default is usually quite messy so I chose not to (although in previous versions of reghdfe there was a reported _cons coefficient).

As a solution, perhaps we could have a suboption that adds back the constant to the first set of intercepts?

Something like reghdfe R , a(i.event##c.( Var1 Var2 Var3), savefe keepconstant)

It would basically do what you are currently doing with the residuals() option but behind the scenes...

Let me know if your current workflow works with the residuals() workaround, and if so I'll try to add it for the next version of reghdfe (~ 1 month or so).

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHubhttps://github.com/sergiocorreia/reghdfe/issues/52#issuecomment-221932671