reghdfe 5.2.2 versus 3.2.9

tatyanaderyugina commented 6 years ago

I was trying to run a test to see how much faster reghdfe 5.2.2 versus 3.2.9 (latest ssc version) is, but discovered that I'm unable to run 5.2.2 at all because of what looks like a memory issue. Here's the error:

     Factor::sort():  3900  unable to allocate real <tmp>[2284668,376]
FixedEffects::project_one_fe():     -  function returned error
transform_sym_kaczmarz():     -  function returned error
     accelerate_cg():     -  function returned error
FixedEffects::_partial_out():     -  function returned error
FixedEffects::partial_out():     -  function returned error
             <istmt>:     -  function returned error

The regression specification looks like this: reghdfe PM25_conc i.poll_cluster#ib0.ang_range, /// absorb(i.weather i.F1weather i.F2weather i.county_fips i.month#i.year i.state_fips#i.month i.poll_cluster#i.L1ang_range i.poll_cluster#i.L2ang_range i.poll_cluster#i.F1ang_range i.poll_cluster#i.F2ang_range) vce(cluster county_fips)

So there are many dimensions of fixed effects and some are many-dimensional. poll_cluster has almost 100 distinct values and ang_range has 3. Here's the relevant output from 3.2.9 for the fixed effects:

Absorbed degrees of freedom: ---------------------------------------------------------------------------+ Absorbed FE | Num. Coefs. = Categories - Redundant | -------------------------+-------------------------------------------------| weather | 9525 9525 0 | F1weather | 9492 9493 1 | F2weather | 9525 9526 1 ? | county_fips | 0 797 797 * | month#year | 179 180 1 ? | state_fips#month | 576 588 12 ? | poll_cluster#L1ang_range | 375 376 1 ? | poll_cluster#L2ang_range | 282 376 94 ? | poll_cluster#F1ang_range | 282 376 94 ? | poll_cluster#F2ang_range | 282 376 94 ? | ---------------------------------------------------------------------------+

The regression has about 2.3 million observations. What I'm wondering is why 3.2.9 can handle this regression but 5.2.2 does not seem to be able to.

sergiocorreia commented 6 years ago

Hi Tatyana,

My guess is that you are running this on a laptop with not enough memory. The error occurred when creating the matrix of regressors in Mata, (a 2284668 x 376 matrix, as you noted). That takes 2284668 376 8 / 2^30 = 6.4Gb which is not that much, but still will fail on most laptops.

Now, the problem is that at one point you will have one copy of the dataset in Stata and another copy in Mata, so you would be doubling the space used.

The fix to this is from a commit a few weeks old: https://github.com/sergiocorreia/reghdfe/commit/c1c3fb1ab7798eb37ba4649f704fdef34d5b3cc1

Namely, try to run the latest Stata with something like reghdfe ... , absorb(...) compact poolsize(1) and it should work.

However, poolsize(1) is going to be quite slow, because the 376 variables will be loaded each at a time. So I would try to run poolsize(20) or something similar, and compare against that.

Best, S

tatyanaderyugina commented 6 years ago

Thanks, Sergio, I will try that. But I am running this on a desktop for the record!

tatyanaderyugina commented 6 years ago

Hi Sergio,

The compact option resolved my problem with reghdfe. But I am also having the same problem with ivreghdfe, which we use on the same data, and it says "option compact not allowed". Any chance you will incorporate the compact option into ivreghdfe in the near future?

Tatyana

sergiocorreia commented 6 years ago

Interesting. If you type "set trace on" "set tracedepth 2", and then run ivreghdfe, at what time is it giving the error?

tatyanaderyugina commented 6 years ago

Looks like it's right after a "syntax" line. Here's the log file (if you search for "option compact not allowed", you'll see the line).

Reghdfe_compact_error.log

sergiocorreia commented 6 years ago

Since ivreghdfe is just ivreg2 with an absorb option, it does not have the compact option.

That said, if it is possible with your dataset to run the same command with ivreg2 (except for the absorb() part), then there might be a way to add compact without too many problems. If instead ivreg2 also runs out of memory, then the issue would be way harder to fix.

Danferno commented 3 years ago

Hi Tatyana,

My guess is that you are running this on a laptop with not enough memory. The error occurred when creating the matrix of regressors in Mata, (a 2284668 x 376 matrix, as you noted). That takes 2284668 376 8 / 2^30 = 6.4Gb which is not that much, but still will fail on most laptops.

Now, the problem is that at one point you will have one copy of the dataset in Stata and another copy in Mata, so you would be doubling the space used.

The fix to this is from a commit a few weeks old: c1c3fb1

Namely, try to run the latest Stata with something like reghdfe ... , absorb(...) compact poolsize(1) and it should work.

However, poolsize(1) is going to be quite slow, because the 376 variables will be loaded each at a time. So I would try to run poolsize(20) or something similar, and compare against that.

Best, S

Might be worth mentioning this (more) explicitly in the help file. Using compact turned a too-big-for-memory regression into an easily managed one for me. Maybe also add to the 3900 error message?

sergiocorreia / reghdfe

reghdfe 5.2.2 versus 3.2.9 #139