lrberge / fixest

Fixed-effects estimations
https://lrberge.github.io/fixest/
362 stars 59 forks source link

Unexpected `call_env` / `summary_flags` behaviour in model objects when called within a function. #514

Open aaronrudkin opened 1 week ago

aaronrudkin commented 1 week ago

I am running models on enormous amounts of data (20-odd million observations, 1000-odd variables counting FEs). We want to store fit model objects so that we can dynamically re-run various outputs (etable, coefplot, etc.). We run the models with clustered SEs and store the lean versions of the models in rds files for use on other machines.

I am running into an unusual case where the fit model objects end up enormous when the models are assembled inside function calls.

Here's an example. It's not quite a reprex, because you need the data, but I think it at least gives an intuition:

# big_dat is a ~300MB subset of my data
# big_formula is a DV ~ var1 + var2 + var3 | fevar1 + fevar2 style formula
# clustvar is the clustering variable

# Correct behavior:
test_obj <- feols(
 big_formula,
 lean = TRUE,
 cluster = ~ clustvar,
 data = big_dat
)

# Incorrect behavior:
test_func = function(x) {
  feols(
    big_formula,
    lean = TRUE,
    cluster = ~ clustvar,
    data = x)
}
test_obj2 = test_func(big_dat)

# Result
pryr::object_size(test_obj)
# 123.02 kB
pryr::object_size(test_obj2)
# 334.78 MB

This is not a misfire from pryr; attempts to serialize the resulting objects to a file reflect the same size disparity. I introspected the objects to figure out where the disconnect is:

for(obj in ls(test_obj)) {
  size1 <- pryr::object_size(test_obj[[obj]])
  size2 <- pryr::object_size(test_obj2[[obj]])
  if(size1 != size2) {
    print(paste0(obj, ": ", size1, " / ", size2))
  }
}
# [1] "call: 5264 / 4984"
# [1] "call_env: 336 / 334654480"
# [1] "summary_flags: 840 / 334654872"

The discrepancy in call is clearly use the length of the data argument being shrunk. Let's not worry about that. But as you can see, the summary_flags and the call_env are both carrying around the environment of the call in full, even with lean = TRUE given as an argument.

I can solve the problem by NULLing out these objects before serializing, and it doesn't seem to cause any downstream issues I wouldn't expect. I assume this is an oversight.

Suggested fix: have lean = TRUE drop the environment from the result object.