Open tatyanaderyugina opened 6 years ago
Hi Tatyana,
My guess is that you are running this on a laptop with not enough memory. The error occurred when creating the matrix of regressors in Mata, (a 2284668 x 376 matrix, as you noted). That takes 2284668 376 8 / 2^30 = 6.4Gb which is not that much, but still will fail on most laptops.
Now, the problem is that at one point you will have one copy of the dataset in Stata and another copy in Mata, so you would be doubling the space used.
The fix to this is from a commit a few weeks old: https://github.com/sergiocorreia/reghdfe/commit/c1c3fb1ab7798eb37ba4649f704fdef34d5b3cc1
Namely, try to run the latest Stata with something like reghdfe ... , absorb(...) compact poolsize(1)
and it should work.
However, poolsize(1)
is going to be quite slow, because the 376 variables will be loaded each at a time. So I would try to run poolsize(20)
or something similar, and compare against that.
Best, S
Thanks, Sergio, I will try that. But I am running this on a desktop for the record!
Hi Sergio,
The compact option resolved my problem with reghdfe. But I am also having the same problem with ivreghdfe, which we use on the same data, and it says "option compact not allowed". Any chance you will incorporate the compact option into ivreghdfe in the near future?
Tatyana
Interesting. If you type "set trace on" "set tracedepth 2", and then run ivreghdfe, at what time is it giving the error?
Looks like it's right after a "syntax" line. Here's the log file (if you search for "option compact not allowed", you'll see the line).
Since ivreghdfe is just ivreg2 with an absorb option, it does not have the compact
option.
That said, if it is possible with your dataset to run the same command with ivreg2
(except for the absorb() part), then there might be a way to add compact
without too many problems. If instead ivreg2
also runs out of memory, then the issue would be way harder to fix.
Hi Tatyana,
My guess is that you are running this on a laptop with not enough memory. The error occurred when creating the matrix of regressors in Mata, (a 2284668 x 376 matrix, as you noted). That takes 2284668 376 8 / 2^30 = 6.4Gb which is not that much, but still will fail on most laptops.
Now, the problem is that at one point you will have one copy of the dataset in Stata and another copy in Mata, so you would be doubling the space used.
The fix to this is from a commit a few weeks old: c1c3fb1
Namely, try to run the latest Stata with something like
reghdfe ... , absorb(...) compact poolsize(1)
and it should work.However,
poolsize(1)
is going to be quite slow, because the 376 variables will be loaded each at a time. So I would try to runpoolsize(20)
or something similar, and compare against that.Best, S
Might be worth mentioning this (more) explicitly in the help file. Using compact
turned a too-big-for-memory regression into an easily managed one for me. Maybe also add to the 3900 error message?
I was trying to run a test to see how much faster reghdfe 5.2.2 versus 3.2.9 (latest ssc version) is, but discovered that I'm unable to run 5.2.2 at all because of what looks like a memory issue. Here's the error:
The regression specification looks like this: reghdfe PM25_conc i.poll_cluster#ib0.ang_range, /// absorb(i.weather i.F1weather i.F2weather i.county_fips i.month#i.year i.state_fips#i.month i.poll_cluster#i.L1ang_range i.poll_cluster#i.L2ang_range i.poll_cluster#i.F1ang_range i.poll_cluster#i.F2ang_range) vce(cluster county_fips)
So there are many dimensions of fixed effects and some are many-dimensional. poll_cluster has almost 100 distinct values and ang_range has 3. Here's the relevant output from 3.2.9 for the fixed effects:
Absorbed degrees of freedom: ---------------------------------------------------------------------------+ Absorbed FE | Num. Coefs. = Categories - Redundant | -------------------------+-------------------------------------------------| weather | 9525 9525 0 | F1weather | 9492 9493 1 | F2weather | 9525 9526 1 ? | county_fips | 0 797 797 * | month#year | 179 180 1 ? | state_fips#month | 576 588 12 ? | poll_cluster#L1ang_range | 375 376 1 ? | poll_cluster#L2ang_range | 282 376 94 ? | poll_cluster#F1ang_range | 282 376 94 ? | poll_cluster#F2ang_range | 282 376 94 ? | ---------------------------------------------------------------------------+
The regression has about 2.3 million observations. What I'm wondering is why 3.2.9 can handle this regression but 5.2.2 does not seem to be able to.