I am running models on enormous amounts of data (20-odd million observations, 1000-odd variables counting FEs). We want to store fit model objects so that we can dynamically re-run various outputs (etable, coefplot, etc.). We run the models with clustered SEs and store the lean versions of the models in rds files for use on other machines.
I am running into an unusual case where the fit model objects end up enormous when the models are assembled inside function calls.
Here's an example. It's not quite a reprex, because you need the data, but I think it at least gives an intuition:
# big_dat is a ~300MB subset of my data
# big_formula is a DV ~ var1 + var2 + var3 | fevar1 + fevar2 style formula
# clustvar is the clustering variable
# Correct behavior:
test_obj <- feols(
big_formula,
lean = TRUE,
cluster = ~ clustvar,
data = big_dat
)
# Incorrect behavior:
test_func = function(x) {
feols(
big_formula,
lean = TRUE,
cluster = ~ clustvar,
data = x)
}
test_obj2 = test_func(big_dat)
# Result
pryr::object_size(test_obj)
# 123.02 kB
pryr::object_size(test_obj2)
# 334.78 MB
This is not a misfire from pryr; attempts to serialize the resulting objects to a file reflect the same size disparity. I introspected the objects to figure out where the disconnect is:
The discrepancy in call is clearly use the length of the data argument being shrunk. Let's not worry about that. But as you can see, the summary_flags and the call_env are both carrying around the environment of the call in full, even with lean = TRUE given as an argument.
I can solve the problem by NULLing out these objects before serializing, and it doesn't seem to cause any downstream issues I wouldn't expect. I assume this is an oversight.
Suggested fix: have lean = TRUE drop the environment from the result object.
I am running models on enormous amounts of data (20-odd million observations, 1000-odd variables counting FEs). We want to store fit model objects so that we can dynamically re-run various outputs (etable, coefplot, etc.). We run the models with clustered SEs and store the lean versions of the models in rds files for use on other machines.
I am running into an unusual case where the fit model objects end up enormous when the models are assembled inside function calls.
Here's an example. It's not quite a reprex, because you need the data, but I think it at least gives an intuition:
This is not a misfire from pryr; attempts to serialize the resulting objects to a file reflect the same size disparity. I introspected the objects to figure out where the disconnect is:
The discrepancy in
call
is clearly use the length of the data argument being shrunk. Let's not worry about that. But as you can see, the summary_flags and the call_env are both carrying around the environment of the call in full, even withlean = TRUE
given as an argument.I can solve the problem by NULLing out these objects before serializing, and it doesn't seem to cause any downstream issues I wouldn't expect. I assume this is an oversight.
Suggested fix: have
lean = TRUE
drop the environment from the result object.