sergiocorreia / reghdfe

Linear, IV and GMM Regressions With Any Number of Fixed Effects
http://scorreia.com/software/reghdfe/
MIT License
219 stars 57 forks source link

[BUG] In some cases, baselines for indicators are ignored by reghdfe #210

Closed felixpoege closed 4 years ago

felixpoege commented 4 years ago

Hi Sergio,

here is a little bit of strange behavior in the latest version of reghdfe. There seems to be a workaround, so this is probably low-priority, but I still wanted to file a bug report.

use http://www.stata-press.com/data/r15/nlswork, clear bys idcode: egen any_union = max(union) replace any_union = 0 if missing(any_union)

// -> They (wrongly) both have the same output, 88 is omitted. reghdfe ln_wage ib68.year#c.any_union, a(i.year idcode) reghdfe ln_wage ib88.year#c.any_union, a(i.year idcode)

// -> Now, the correct baseline is chosen, but too many unnecessary colinearity checks are done reghdfe ln_wage ib68.year##c.any_union, a(i.year idcode) noomitted

Best, Felix

sergiocorreia commented 4 years ago

Hi Felix,

reghdfe uses Stata internal commands to expand factor interactions (else it would be way too complicated and inconsistent with the rest of Stata) so I was a bit puzzled with your example.

I replicated it with areg and it seems reghdfe works exactly as areg here:

use http://www.stata-press.com/data/r15/nlswork, clear
bys idcode: egen any_union = max(union)
replace any_union = 0 if missing(any_union)
* Reduce number of unique idcodes so areg can actually run
keep if idcode <= 100

areg ln_wage i.idcode ib68.year#c.any_union, a(year) noomitted
areg ln_wage i.idcode ib68.year##c.any_union, a(year) noomitted

reghdfe ln_wage ib68.year#c.any_union, a(year idcode) noomitted v(-1)
reghdfe ln_wage ib68.year##c.any_union, a(year idcode) noomitted v(-1)

This suggests that whatever is going on is not something specific to reghdfe, but part of Stata.

Digging a bit deeper:

1) Single interaction (#)

Let's see how Stata expands ib68.year#c.any_union:

. fvexpand ib68.year#c.any_union
. di "`r(varlist)'"
68b.year#c.any_union 69.year#c.any_union 70.year#c.any_union 71.year#c.any_union 72.year#c.any_union 73.year#c.any_union 75.year#c.any_union 77.year#c.any_union 78.year#c.any_union 80.year#c.any_union 82.year#c.any_union 83.year#c.any_union 85.year#c.any_union 87.year#c.any_union 88.year#c.any_union

Stata does not drop any of the expanded variables (which would happen if you see an "o" before the dot. This is reasonable because you wouldn't expect these variables to be collinear, because any_union is continuous.

Now, the problem is that due to other variables (either other regressors or fixed effects), the regressor matrix is perfectly collinear. Thus we have to drop collinear variables, but we just drop the first one instead of using the base variable, as by that point reghdfe is just dealing with matrices and not keeping track of the names and details of what is in each column (e.g. what column represents a base variable).

2) Double interaction

First note that the varlist ib68.year##c.any_union is equivalent to ib68.year c.any_union ib68.year#c.any_union. Thus, you should see an omitted variable due to perfect collinearity, as the sum of the variables ib68.year#c.any_union is equal to c.any_union.

. fvexpand ib68.year##c.any_union
. di "`r(varlist)'"
68b.year 69.year 70.year 71.year 72.year 73.year 75.year 77.year 78.year 80.year 82.year 83.year 85.year 87.year 88.year any_union 68b.year#co.any_union 69.year#c.any_union 70.year#c.any_union 71.year#c.any_union 72.year#c.any_union 73.year#c.any_union 75.year#c.any_union 77.year#c.any_union 78.year#c.any_union 80.year#c.any_union 82.year#c.any_union 83.year#c.any_union 85.year#c.any_union 87.year#c.any_union 88.year#c.any_union

That is indeed the case, as noted by 68b.year#co.any_union (note the "co.").

Thus, what reghdfe receives is all the variables except 68b.year#co.any_union.

3) Wrapping up

Stata and reghdfe work in two steps to drop variables. First, they use the base levels to drop variables that ex ante you know will need to be dropped. Then, once you have partialled out the variables and are trying to solve the linear system, if there are omitted variables then the first of those is dropped.

4) Solution?

If the collinearity is between c.any_union and the interactions, maybe you can hint this to stata by adding any_union by itself, instead of using double interactions? EG:

reghdfe ln_wage any_union ib68.year#c.any_union, a(year idcode) noomitted

Here, Stata will a) drop the base level 68.year#c.any_union as it knows it will be collinear, and then b) it will drop any_union at the end due to perfect collinearity

Hope this helps, and it wasn't too confusing (it was too me, factor variables are hard to understand/implement!)

meadover commented 4 years ago

Factor variables are tricky but oh so powerful. One of the best reasons to use Stata.

Is the moral of the story that one should, as a general rule, avoid all double interactions like i.x##i.y or i.x##c.y? Or, perhaps, that one should always experiment with the expanded version to see if it makes a difference?

Mead

From: Sergio Correia notifications@github.com Sent: Thursday, September 3, 2020 2:51 PM To: sergiocorreia/reghdfe reghdfe@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [sergiocorreia/reghdfe] [BUG] In some cases, baselines for indicators are ignored by reghdfe (#210)

Hi Felix,

reghdfe uses Stata internal commands to expand factor interactions (else it would be way too complicated and inconsistent with the rest of Stata) so I was a bit puzzled with your example.

I replicated it with areg and it seems reghdfe works exactly as areg here:

use http://www.stata-press.com/data/r15/nlswork, clear

bys idcode: egen any_union = max(union)

replace any_union = 0 if missing(any_union)

keep if idcode <= 100

areg ln_wage i.idcode ib68.year#c.any_union, a(year) noomitted

areg ln_wage i.idcode ib68.year##c.any_union, a(year) noomitted

reghdfe ln_wage ib68.year#c.any_union, a(year idcode) noomitted v(-1)

reghdfe ln_wage ib68.year##c.any_union, a(year idcode) noomitted v(-1)

This suggests that whatever is going on is not something specific to reghdfe, but part of Stata.

Digging a bit deeper:

  1. Single interaction (#)

Let's see how Stata expands ib68.year#c.any_union:

. fvexpand ib68.year#c.any_union

. di "`r(varlist)'"

68b.year#c.any_union 69.year#c.any_union 70.year#c.any_union 71.year#c.any_union 72.year#c.any_union 73.year#c.any_union 75.year#c.any_union 77.year#c.any_union 78.year#c.any_union 80.year#c.any_union 82.year#c.any_union 83.year#c.any_union 85.year#c.any_union 87.year#c.any_union 88.year#c.any_union

Stata does not drop any of the expanded variables (which would happen if you see an "o" before the dot. This is reasonable because you wouldn't expect these variables to be collinear, because any_union is continuous.

Now, the problem is that due to other variables (either other regressors or fixed effects), the regressor matrix is perfectly collinear. Thus we have to drop collinear variables, but we just drop the first one instead of using the base variable, as by that point reghdfe is just dealing with matrices and not keeping track of the names and details of what is in each column (e.g. what column represents a base variable).

  1. Double interaction

First note that the varlist ib68.year##c.any_union is equivalent to ib68.year c.any_union ib68.year#c.any_union. Thus, you should see an omitted variable due to perfect collinearity, as the sum of the variables ib68.year#c.any_union is equal to c.any_union.

. fvexpand ib68.year##c.any_union

. di "`r(varlist)'"

68b.year 69.year 70.year 71.year 72.year 73.year 75.year 77.year 78.year 80.year 82.year 83.year 85.year 87.year 88.year any_union 68b.year#co.any_union 69.year#c.any_union 70.year#c.any_union 71.year#c.any_union 72.year#c.any_union 73.year#c.any_union 75.year#c.any_union 77.year#c.any_union 78.year#c.any_union 80.year#c.any_union 82.year#c.any_union 83.year#c.any_union 85.year#c.any_union 87.year#c.any_union 88.year#c.any_union

That is indeed the case, as noted by 68b.year#co.any_union (note the "co.").

Thus, what reghdfe receives is all the variables except 68b.year#co.any_union.

  1. Wrapping up

Stata and reghdfe work in two steps to drop variables. First, they use the base levels to drop variables that ex ante you know will need to be dropped. Then, once you have partialled out the variables and are trying to solve the linear system, if there are omitted variables then the first of those is dropped.

  1. Solution?

If the collinearity is between c.any_union and the interactions, maybe you can hint this to stata by adding any_union by itself, instead of using double interactions? EG:

reghdfe ln_wage any_union ib68.year#c.any_union, a(year idcode) noomitted

Here, Stata will a) drop the base level 68.year#c.any_union as it knows it will be collinear, and then b) it will drop any_union at the end due to perfect collinearity

Hope this helps, and it wasn't too confusing (it was too me, factor variables are hard to understand/implement!)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/sergiocorreia/reghdfe/issues/210#issuecomment-686692796, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALEY4QZNFT3VEBWBG4SBJBTSD7QRTANCNFSM4QVATCGA.

sergiocorreia commented 4 years ago

Factor variables are tricky but oh so powerful. One of the best reasons to use Stata.

Agreed!

Is the moral of the story that one should, as a general rule, avoid all double interactions like i.x##i.y or i.x##c.y? Or, perhaps, that one should always experiment with the expanded version to see if it makes a difference?

Experiment, perhaps. Factor variables can't detect collinearity, and the commands that drop collinear variables are not aware of what the base factors are. Until the day this happens (if it happens), let's just remember that Stata will just drop the first perfectly collinear variable.

Also, this might be because c.var is actually i.var as it only takes values of 0 and 1. Otherwise, it would be unlikely for a continuous variable to be perfectly collinear with the fixed effects

felixpoege commented 4 years ago

Sergio, thank you for your effort in clearing this up - much clearer now! I see how it is a general difficulty and unrelated to reghdfe as such and therefore definitely 'case closed'.