meadover commented 4 years ago

With Stata I am estimating a structural equation model (see help sem) with 200,000 observations. I would like to include fixed effects.. When I include categorical variables on the right-hand-side to estimate those fixed effects, convergence becomes a problem. I find I can reach convergence by first "demeaning" the variables using Fernando Rios-Avila's Stata program -itercenter-. [See Stata Journal, volume 15, number 3: st0409] However, -itercenter- and Rios-Avila's -regxfe- that uses it seem to be quite a bit slower than -reghdfe-.

Is it posible to use -reghdfe- to demean the right-hand-side variables and then save them for use in another linear estimation such as e.g. sem?

sergiocorreia commented 4 years ago

Hi Mead,

You can do it in Mata easily, which is essentially what reghdfe does (via the fixed_effects() class):

cls
sysuse auto, clear
mata: HDFE = fixed_effects("turn trunk")
mata: data = HDFE.partial_out("price weight gear_ratio")
mata: st_store(HDFE.sample, st_addvar("double", tokens("R_price R_weight R_gear_ratio")), data)

* Now we can compare the results with doing it directly. Point values are the same but of course SEs are not.
reg R_*
reghdfe price weight gear_ratio, a(turn trunk)

For more info, check help reghdfe_mata

meadover commented 4 years ago

Wow, I had not appreciated how elegantly you had used Mata to implement -reghdfe-.

To get the correct standard errors, I suppose I could:

demean the data as you suggest
estimate my SEM model on the demeaned data
store the coefficient estimates as a Stata matrix with mat def eb = e(b)
estimate the SEM model on the raw data, but adding categorical variables for the fixed effects (by using SEM's option -from(eb, skip)- so that maximization starts from the correct coefficients).

From: Sergio Correia notifications@github.com Sent: Tuesday, May 19, 2020 6:15 PM To: sergiocorreia/reghdfe reghdfe@noreply.github.com Cc: Mead Over (mover@CGDEV.ORG) MOver@CGDEV.ORG; Author author@noreply.github.com Subject: Re: [sergiocorreia/reghdfe] Option to save the demeaned variables (#208)

Hi Mead,

You can do it in Mata easily, which is essentially what reghdfe does (via the fixed_effects() class):

cls sysuse auto, clear mata: HDFE = fixed_effects("turn trunk") mata: data = HDFE.partial_out("price weight gear_ratio") mata: HDFE.sample mata: st_store(HDFE.sample, st_addvar("double", tokens("R_price R_weight R_gear_ratio")), data)

Now we can compare the results with doing it directly. Point values are the same but of course SEs are not. reg R_* reghdfe price weight gear_ratio, a(turn trunk)

For more info, check help reghdfe_mata

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/sergiocorreia/reghdfe/issues/208#issuecomment-631112316, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALEY4Q2ARQVPTVVSPZYZFADRSMAHNANCNFSM4NFMFTZQ.

sergiocorreia commented 4 years ago

That strategy makes sense, but I'm not sure how will -sem- deal with the initial values for the dummies. If you only have one level of fixed effects then it should be easy (e.g. in ols you know that the fixed effects are y-xb-resid, so you can back them out).

Alternatively, maybe you can run a SEM equivalent of bootstrap, which would avoid the issue altogether?

meadover commented 4 years ago

Thanks for the bootstrapping suggestion, Sergio.

By estimating the coefficients on the demeaned data, I will have the "correct" starting values for the coefficients, which should be close to the global maxima values in each bootstrap sample. In this situation, provided the data is not too ill conditioned, perhaps bootstrapping will not be too time-consuming.

If I wanted to code an analytical solution for the standard errors, could I use part of your code? Can you recommend the best exposition of the math for that?

Mead

From: Sergio Correia notifications@github.com Sent: Tuesday, May 19, 2020 10:27 PM To: sergiocorreia/reghdfe reghdfe@noreply.github.com Cc: Mead Over (mover@CGDEV.ORG) MOver@CGDEV.ORG; Author author@noreply.github.com Subject: Re: [sergiocorreia/reghdfe] Option to save the demeaned variables (#208)

That strategy makes sense, but I'm not sure how will -sem- deal with the initial values for the dummies. If you only have one level of fixed effects then it should be easy (e.g. in ols you know that the fixed effects are y-xb-resid, so you can back them out).

Alternatively, maybe you can run a SEM equivalent of bootstrap, which would avoid the issue altogether?

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/sergiocorreia/reghdfe/issues/208#issuecomment-631195752, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALEY4Q6GB24MQ4Q2YUX72WTRSM5WBANCNFSM4NFMFTZQ.

[https://secureimages.mcafee.com/common/affiliateImages/mfe/logo.png]https://home.mcafee.com/utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient Scanned by McAfeehttps://home.mcafee.com/utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient and confirmed virus-free.

sergiocorreia commented 4 years ago

If I wanted to code an analytical solution for the standard errors, could I use part of your code? Can you recommend the best exposition of the math for that?

What's the formula for SEs in SEM?

The part that computes degrees-of-freedom lost is also modular, as well as the one that computes standard/robust/cluster standard errors. But I'm not an expert at SEM, so I don't know if there are other formulas used.

meadover commented 4 years ago

Thanks, Sergio. I'll look at the formulas for the SEs in SEM.

The SEM I'm estimating is a particularly simple member of the SEM family called a MIMIC model. It's just a linear factor analysis model connected through a single continuous latent variable to an OLS regression model. The Stata syntax is:

sem (y1-yM <-L) (L <- x1-xK d1-dJ)

where L is the latent variable, y1-yM are the M indicators, x1-xK are the K independent variables and d1-dJ are the J categorical variables which define the fixed effects.

Using your favorite example data set, auto.dta, perhaps we could think of causes and indicators of a latent variable called "luxury". Perhaps there are unobservable variables affecting luxury that are associated with the repair record, rep78, and whether the car is foreign or domestic. So a MIMIC model capturing these ideas, and using your program -hdfe- from SSC, might be estimated as follows:

sysuse auto, clear sem (weight length headroom <- L) (L <- gear_ratio turn mpg price) est store wofe hdfe weight length headroom gear_ratio turn mpg price , clear absorb(rep78 foreign) sem (weight length headroom <- L) (L <- gear_ratio turn mpg price), noconstant est store demeaned est tab wofe demeaned, t

I understand that the second set of estimates, which I name -demeaned-, have incorrect SE's because the demeaning has "used up" 6 degrees of freedom. If -reghdfe- essentially inflates the SE's estimated on the demeaned data by the ratio of the "true" DF to the "apparent" DF, I conjecture that this might work for the MIMIC model. (For example, the SEs of the -demeaned- model might all be inflated by the ratio 11/5.) I could try it by comparing a bootstrapped estimate of the SEs with SEs that have been inflated in this way. Not an analytic proof of course.

By the way, what would be the advantages of my using your suggested Mata code over using your previously contributed Stata program, -hdfe- as I do above? I kind of like the idea of using your SSC program so I can cite it and the paper it's based on, which I find herehttp://scorreia.com/research/hdfe.pdf, rather than citing a subroutine you have embedded in -reghdfe- (if that's what the MATA program HDFE is).

Mead

From: Sergio Correia notifications@github.com Sent: Wednesday, May 20, 2020 3:47 PM To: sergiocorreia/reghdfe reghdfe@noreply.github.com Cc: Mead Over (mover@CGDEV.ORG) MOver@CGDEV.ORG; Author author@noreply.github.com Subject: Re: [sergiocorreia/reghdfe] Option to save the demeaned variables (#208)

If I wanted to code an analytical solution for the standard errors, could I use part of your code? Can you recommend the best exposition of the math for that?

What's the formula for SEs in SEM?

The part that computes degrees-of-freedom lost is also modular, as well as the one that computes standard/robust/cluster standard errors. But I'm not an expert at SEM, so I don't know if there are other formulas used.

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/sergiocorreia/reghdfe/issues/208#issuecomment-631687552, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALEY4Q6DM7TWAVP4CLAVE4DRSQXR5ANCNFSM4NFMFTZQ.

[https://secureimages.mcafee.com/common/affiliateImages/mfe/logo.png]https://home.mcafee.com/utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient Scanned by McAfeehttps://home.mcafee.com/utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient and confirmed virus-free.

meadover commented 4 years ago

Sergio,

What would be the advantages of my using your suggested Mata code over using your previously contributed Stata program, -hdfe- as I do above? I kind of like the idea of using your SSC program -hdfe- so I can cite it and the paper it's based on, which I find here http://scorreia.com/research/hdfe.pdf, rather than citing a subroutine you have embedded in -reghdfe- (if that's what the MATA program HDFE is).

But is -hdfe- up to date?

Mead

sergiocorreia commented 4 years ago

Hi Mead,

Sorry for the long delay. I haven't really updated hdfe in a long time (it's from the pre-Mata days).

That said, it should be relatively straightforward to write an updated version based on the Mata code above. What would you need?

A way to partial out variables and replace them?
A way to compute degrees-of-freedom lost due to the fixed effects? (is this useful or do you plan on running a bootstrap?)

Because you will be partialling out a lot, the usual approach done by hdfe.ado is not as good (as it recreates the HDFE object every time), so something better would be:

hdfe init <fixed effect variables>
hdfe partial <varlist>
sem ...
hdfe partial <varlist>
...

Does this make sense?

meadover commented 4 years ago

Thanks for getting back, Sergio. I am continuing to work on this problem and therefore continuing to read the FE and VA literature. Fascinating stuff. And I am beginning to appreciate the difficulties involved in recovering the values of fixed effects and their standard errors after estimation on partialed data. Clearly the bootstrapping strategy is the surest and most defensible way to the SE’s for an estimator other than OLS. But it then becomes essential that each replication be as fast as possible, which always brings me back to -hdfe- for partialing out the means.

Your proposed revision of -hdfe-

By the fact that the current version of -hdfe- does not have subcommands called -init- and -partial-, I think you are proposing to update your -hdfe- to purpose. That would be really super and I think would help others who would like the power of -hdfe- as a stand-alone capability. If you are going to modify -hdfe- in this way, may I suggest you also consider adding the option(s) to add back the global means to the demeaned variables. I can imagine it might be useful for the user to have the option to add back the global mean of only the dependent variable(s), only the explanatory variables or both. This is trivial to do outside your program, but I think it would be faster as well as more convenient if these options are part of your -hdfe-. (I believe that this is the purpose of Fernando Rios-Avila’s -mean- option in his Stata program -itercenter-.) Providing “a way to compute degrees-of-freedom lost due to the fixed effects” would also be really nice, in case of need.

Would options -init- and -partial- help to bootstrap -sem-?

With respect to my own application, in which I would like to bootstrap an -sem- estimation on partialed data, alternative strategies are:

first partial out the means using -hdfe-, then bootstrap a small -ado- program consisting of (i) a call to -sem- to estimate the coefficients on the partialed data, followed by (ii) recover the point estimates of the FE coefficients.
bootstrap a small -ado- program consisting of (i) a call to -hdfe- to partial the data, followed by (ii) a call to -sem- to estimate the coefficients on the partialed data and then by (iii) recover the point estimates of the FE coefficients

Since strategy (a) maintains the global sample means for all bootstrap-generated randomly drawn sub-samples, I suspect that strategy (a) would produce inconsistent standard errors, which could be shown to be biased towards zero (i.e. too small). My intuition is that, by preventing re-computation of the partialed data on each replication, FE coefficient estimates from strategy (a) would vary less over replications than they would using strategy (b). So strategy (a) would be a way to inflate t-statistics unjustifiably. Do you agree?

Thus, I suspect that the desirable asymptotic properties of bootstrapping would only apply to strategy (b). Am I correct?

I’m not sure what your proposed new subcommands for -hdfe-, -init- and -partial-, would do. Would they help with strategy (b)? Or perhaps only with strategy (a)?

Mead

meadover commented 2 years ago

Sergio, do you have any interest in producing an updated version of your stand-alone -hdfe- package such as you suggested in your post above on July 22, 2020? I think it would find interest as a general tool.

sergiocorreia commented 2 years ago

Actually, this is now supported within reghdfe v6, from a earlier this year!

See, first help reghdfe_programming (which is still a WIP).

Moreover, the following code should work as a proof-of-concept:

clear all
cls

sysuse auto
qui include "reghdfe.mata", adopath
reghdfe price weight length, absorb(turn trunk) noregress
mata: HDFE.solution.data

meadover commented 2 years ago

I’m now absorbing your -help reghdfe_programming- page. Thanks for that page. Very helpful.

From: Sergio Correia @.> Sent: Monday, December 13, 2021 9:21 PM To: sergiocorreia/reghdfe @.> Cc: Mead Over @.) @.>; Author @.***> Subject: Re: [sergiocorreia/reghdfe] Option to save the demeaned variables (#208)

Actually, this is now supported within reghdfe v6, from a earlier this year!

See, first help reghdfe_programming (which is still a WIP).

Moreover, the following code should work as a proof-of-concept:

clear all

cls

sysuse auto

qui include "reghdfe.mata", adopath

reghdfe price weight length, absorb(turn trunk) noregress

mata: HDFE.solution.data

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/sergiocorreia/reghdfe/issues/208#issuecomment-993093689, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALEY4QYZENJTLKAXT4R2NILUQ2SZDANCNFSM4NFMFTZQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sergiocorreia / reghdfe

Option to save the demeaned variables #208

Your proposed revision of -hdfe-

Would options -init- and -partial- help to bootstrap -sem-?