Closed ThomasKraft closed 2 years ago
This may be related to https://github.com/statnet/ergm-private/issues/332 .
My best guess is that this is caused by the use of an offset term that @chad-klumb helped me with as an alternative to using bd().
Is the size substantially lower without the offset, or with the offset specified differently (such as a named vertex attribute rather than a formula)?
@chad-klumb It appears that the method of specifying the edgecov variables is driving major size differences, particularly writing the formula using edgecov(variable)
versus edgecov("variable")
with the latter case stored as a network attribute. At first I was tempted to think that this behavior was a result of the size of the network object itself, but running pryr::object_size(fit$constraints)
suggests that it is still the constraint that is large.
I am not sure how to specify the offset as a named attribute but still maintain the current functionality. Do you mean using a nodematch
term within offset?
Here is a working example for the size issue:
library(EpiModel)
library(pryr)
num=1000
toy <- network_initialize(n=num, directed=F)
toy <- set_vertex_attribute(toy, "com_id", c(rep("1", num/4),rep("2", num/4),rep("3", num/2)))
toy <- set_vertex_attribute(toy, "fid", sample(1:40, num, replace=T))
df <- data.frame(pid = toy %v% "vertex.names", com_id = toy %v% "com_id",
com_id = toy %v% "com_id",
fid= toy %v% "fid")
# edge attributes
kin<-matrix(sample(seq(0,.5, .1), num^2, replace=T),num)
kin[lower.tri(kin)] = t(kin)[lower.tri(kin)]
diag(kin) <- 0
dist<-matrix(sample(1:10000, num^2, replace=T),num)
dist[lower.tri(dist)] = t(dist)[lower.tri(dist)]
diag(dist) <- 0
dist <- log(dist+1)
for(i in unique(df$fid)){
for(j in unique(df$fid)){
dist[which(df$fid == i), which(df$fid == j)] <- rnorm(1, 8, sd=2)
}
}
fit_test1 <- netest(toy,
formation = ~edges + nodematch("fid") + edgecov(kin) + edgecov(dist) + offset(absdiff(~as.integer(com_id))) ,
target.stats = c(30*num/20, 30*num/40, 30*.2*num/20, 7.5*num),
coef.form = c(-Inf),
coef.diss = dissolution_coefs(dissolution = ~offset(edges), duration = 3), edapprox = T)
toy2 <- toy
toy2 %n% "dist" <- dist
toy2 %n% "kin" <- kin
fit_test2 <- netest(toy2,
formation = ~edges + nodematch("fid") + edgecov("kin") + edgecov("dist") + offset(absdiff(~as.integer(com_id))) ,
target.stats = c(30*num/20, 30*num/40, 30*.2*num/20, 7.5*num),
coef.form = c(-Inf),
coef.diss = dissolution_coefs(dissolution = ~offset(edges), duration = 3), edapprox = T)
object_size(fit_test1$constraints)
object_size(fit_test2$constraints)
I forgot to mention, dropping the offset term altogether doesn't seem to reduce the size increase of the netest object, or the size of the object$constraints. I find that rather confusing since the constraints should be empty in that case, no?
Can you please drill down deeper? I.e., sapply(object$constraints, object_size)
, sapply(object$constraints$<BIGGEST>, object_size)
, etc.?
The constraints
should be "trivial" (i.e. ~.
) with or without the offset, because you aren't specifying a constraints
argument. Even trivial formulas have environments, though, and those environments could have various large objects in them.
It would be interesting to know how the size from netest
compares with the size from a direct call to ergm
. There are some environment modifications in netest
(specifically lines 189 and 211) that affect the formula
and constraints
of the ergm
fit:
environment()
(with no arguments) returns the current evaluation environment
That is correct, looking within constraints
simply returns: ~.
(as well as the environment <environment: 0x7fdc40796cc0>
). As such, sapply(object$constraints, object_size)
doesn't return much information. In this case "56 56", but there aren't really any objects there.
But I think @chad-klumb nailed it about an issue with the environment. I've been trying to figure out why I'm getting such variable output sizes of netest
objects when trying multiple times, and I had almost managed to convince myself that the control.ergm parameters were making a big difference. But if you run the example code above (sorry I fixed the typos there, it runs now) with a clean workspace, fit_test1 is ~4 MB, whereas fit_test2 is ~75 MB. netest
therefore does seem to be storing some of my global environment within constraints, but it is not at all obvious why that differs between these cases (which depends on how the edgecov variables are specified).
Is there a clear best way to reduce my netest
size? For example, does it make sense to clear my global environment of anything extraneous before running netest()
, or can I just clear the environment within the netest
object?
You might also be interested to know that this can cause major issues going forward when using netsim()
-- the memory usage from an oversized netest
object spirals out of control when running simulations from it.
My bad. I had mixed up constraint
with constrained
(which is a list).
To elaborate on what @chad-klumb wrote, a formula created inside a function will carry its environment around. One that is created outside a function will have globalenv()
as its environment and will not carry it around.
statnet.common
has a function trim_env()
, which replaces an object's environment with one that is empty except for a selected set of variables, and nonsimp_update.formula()
similarly tries to be more efficient with environments. I've made changes in the statnet/EpiModel@trim_env
branch: give them a try.
Thanks @krivit. In running checks, I'm realizing that the example code I posted isn't doing a good job of reproducing the issue. In the example, the network is being stored 3 times in the model fit (model$fit$network
, model$fit$newnetwork
, model$fit$newnetworks
), and the size difference is accounted for by the size of the matrices included as network attributes multiplied by 3.
In my empirical example that originally generated the question, it is very different (see image below). You can see >10 GB stored under both formula
and constraints
. Is there a way that I can see what objects are actually stored within those environments?
Once I'm done making sure I can reproduce the issue I'll re-run with the @trim_env branch.
@krivit unfortunately using the @trim_env branch does not appear to change the outcomes I'm experiencing.
However, I have discovered that part of the object size issue is deriving from size expansion during saving and loading an netest object (using saveRDS
and readRDS
). It is very simple to reproduce this issue: running the example above, if you then do the following:
object_size(fit_test2)
saveRDS(fit_test2, file="fit_test2.Rds")
rm(list=ls())
readRDS(file="fit_test2.Rds")
object_size(fit_test2)
You will see that saving and loading increases the size of the object in the environment by ~4x! This is not a compression issue as far as I can tell. Because my code saves the netest
object for use again later, this seems to lie at the heart of my problem. Is there an obvious explanation for this?
Another clue: re-loading a netest
object and then re-setting the environment of constraints (and formula) to the global environment brings the object size back down to expected size. netsim
appears to work fine on the resulting object. This is obviously a terrible hack, but might this work as a solution?
saved_mod <- readRDS("fit_test2.Rds")
object_size(saved_mod) # ~200 MB
environment(towl$fit$constraints) <- .GlobalEnv
environment(towl$fit$formula) <- .GlobalEnv
object_size(saved_mod) # ~50MB
sim <- netsim(saved_mod, param, init, control)
Using trim_env
manually on a completed netest
also seems to effectively bring the size down to the expected value. Hard to see why @krivit 's changes in statnet/EpiModel@trim_env (commit bca5d83
) don't therefore solve this issue.
Perhaps it is noteworthy that trim_env
has differential effects on size when used on model$fit$constraints versus model$constraints.
Can you try the by-element object_size analysis for the trim_env
branch? $constraints
should be much smaller; and I suspect that the reason $formula
is so big is that its environment isn't trimmed but rather now contains the LHS network. In fact, it might not actually take up space over and above the rest of the fit, which also contains the network, since it's the same network.
There's still an assignment of constraints
's environment after the model fit:
https://github.com/statnet/EpiModel/blob/bca5d83795755849add9591b6f67434b04d1cc8b/R/netest.R#L210
The by-element object_size analysis still suggests that both the constraint and formula are large.
On Wed, Feb 3, 2021, 6:02 PM chad-klumb notifications@github.com wrote:
There's still an assignment of constraints's environment after the model fit:
https://github.com/statnet/EpiModel/blob/bca5d83795755849add9591b6f67434b04d1cc8b/R/netest.R#L210
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statnet/tergmLite/issues/48#issuecomment-772912779, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGLHD4NGZVSLRGWEJY5BFDS5HPXJANCNFSM4W5DIPBQ .
https://github.com/statnet/EpiModel/blob/bca5d83795755849add9591b6f67434b04d1cc8b/R/netest.R#L210
Ah... I think it might be worth auditing EpiModel
for its environment hygiene.
@ThomasKraft, how about now?
@krivit this has fixed the size issue with both $constraints and $fit$constraints!
However, the issue with formula remains unresolved (very large, expanding size during saving/loading). Regarding your comment earlier, $formula does appear to be taking up quite alot of space over the rest of the fit (not sure if this is because the LHS network is stored in that environment). Perhaps the environment for formula needs to be modified within nosimp_update.formula
(https://github.com/statnet/EpiModel/blob/bca5d83795755849add9591b6f67434b04d1cc8b/R/netest.R#L188)?
@ThomasKraft yes, the LHS network is stored in the formula environment. As Pavel once said, "formulas are devious".
So, the places networks are stored in a fit object are:
I've actually made use of (4) in our egocentric modeling, to ensure nodeset consistency when we run multilayer network simulations. I don't actually use it in any EpiModel code, it's all done in pre-processing steps, so if we want to strip this from the netest
object, this wouldn't be a problem (for me at least).
Thank you for this very clarifying breakdown. It seems that stripping (4) from the netest
object might cause problems with models including edgecov
variables. I've tried modifying the $fit$formula environment directly of the netest
object, and then netsim
subsequently throws an error that it doesn't know where to look for the edgecov
object (defined in the global environment at time of netest
run). But it seems like this should be surmountable by either looking at the current global environment or within the saved network object at runtime (I think).
- fit$newnetworks -- a list of the final networks at the end of the MCMC simulation, one for each thread (i think only for parallel estimation, but @krivit will need to confirm that.
If it's single-threaded, then it's just a list of length 1. (It's the same network as fit$newnetwork
.)
- environment(fit$ergm.formula)$popnw -- the sneaky one. @krivit is this network identical to (1) above?
I am not sure where it's coming from. ergm.ego
constructs a network by that name, but EpiModel
doesn't use ergm.ego
(yet).
I notice that when looking within the netest
object, running ls(environment(model$fit$formula))
returns only "TARGET_STATS", which appears to be just a static network object that is identical() to what @martinamorris listed as (1). This would suggest that it's redundant (at least in my case), but as noted above netsim
seems to look there edgecov
variables specified in the formula.
@krivit, the current EpiModel uses a call to ergm
with target.stats
argument. what does ergm return in the fit$formula
environment in that case?
It's an environment containing TARGET_STATS
, with the formula's original environment as its parent environment.
Can you help me understand why object_size(fit_test2$fit$formula)
is exactly twice as large as object_size(environment(fit_test2$fit$formula)$TARGET_STATS)
?
This appears to be at the heart of the memory duplication that is occurring, and I suspect it is because the network is being stored in the formula's environment. If this is the case, then I believe it creates my original issue when saving the netest
object and re-loading because the relationship between environments becomes unlinked.
To help with diagnosis, here is a minimal example for the remaining memory management problem:
library(EpiModel)
library(pryr)
toy <- network_initialize(n=1000, directed=F)
ft <- netest(toy, formation = ~edges ,
target.stats = c(1500),
coef.diss = dissolution_coefs(dissolution = ~offset(edges), duration = 10), edapprox = T)
object_size(ft) #1.94 MB
save.image(file = "envir.RData")
rm(list=ls())
load("envir.RData")
object_size(ft) #13.5 MB
I can't explain it, but I see it. (though my values for the saved object are a bit bigger than yours). A more complete reprex:
library(EpiModel)
library(pryr)
toy <- network_initialize(n=1000, directed=F)
object_size(toy) # 418 KB, so this is tiny
ft <- netest(toy, formation = ~edges,
target.stats = c(1500),
coef.diss = dissolution_coefs(dissolution = ~offset(edges),
duration = 10),
edapprox = T)
object_size(ft) # 1.94 MB
object_size(environment(ft$fit$formula)) # 1.94 MB the same as the entire fit object?
environment(ft$fit$formula)$TARGET_STATS
object_size(environment(ft$fit$formula)$TARGET_STATS) # 1.27 MB
save.image(file = "envir.RData")
rm(list=ls())
load("envir.RData")
object_size(ft) # 18.5 MB
object_size(environment(ft$fit$formula)) # 15.8 MB
object_size(ft$fit$newnetwork) # 1.27 MB
environment(ft$fit$formula)$TARGET_STATS
object_size(environment(ft$fit$formula)$TARGET_STATS) # 1.27 MB
One factor that would help ameliorate memory issues with large networks would be the ability to discard saved networkDynamic objects automatically when at the end of each run. As far as I know this is no longer possible with the defunct save.network
argument now. @smjenness is there an easy way to throw out the saved network object within each simulation at the end of nsteps
?
tergmLite has been disbanded (its functionality has been folded into EpiModel itself). If there are any relevant open sub-issues here, please open new issues on the EpiModel or statnet package repositories. Thanks!
Using
netest
to fit a relatively dense network (~8k nodes, 22k edges), I am encountering extremely high memory demands from unexpected parts of the object ("formula" and "constraints" within $fit). For example:pryr::object_size(model_fit$fit$formula)
yields 10.3 GBpryr::object_size(model_fit$fit$constraints)
yields 10.3 GBGiven that the network object is only 1.3 GB or so, this seems surprisingly high. My best guess is that this is caused by the use of an offset term that @chad-klumb helped me with as an alternative to using
bd()
. Is it possible that the large size is affected by using a high values for values for MCMC burnin and intervals? The code I'm fitting is as follows: