statnet / tergmLite

A fast, simplified version of TERGM-based simulation for epidemic modeling
7 stars 4 forks source link

netest memory usage for "formula" and "constraints" #48

Closed ThomasKraft closed 2 years ago

ThomasKraft commented 3 years ago

Using netest to fit a relatively dense network (~8k nodes, 22k edges), I am encountering extremely high memory demands from unexpected parts of the object ("formula" and "constraints" within $fit). For example:

pryr::object_size(model_fit$fit$formula) yields 10.3 GB

pryr::object_size(model_fit$fit$constraints) yields 10.3 GB

Given that the network object is only 1.3 GB or so, this seems surprisingly high. My best guess is that this is caused by the use of an offset term that @chad-klumb helped me with as an alternative to using bd(). Is it possible that the large size is affected by using a high values for values for MCMC burnin and intervals? The code I'm fitting is as follows:

model_fit <- netest(nw, 
                      formation = ~edges + nodefactor("male") + 
                        nodematch("male") + 
                        nodemix("age_cat", levels2= c(2:15) ) + 
                        nodematch("fid") +
                        edgecov("kc") +
                        edgecov("akmc") +
                        edgecov("ldm") +
                        offset(absdiff(~as.integer(com_id))), 
                      coef.form = c(-Inf),
                      target.stats = c(edge_tar,  male_tar,
                                       edge_tar*sex_homophily_tar,
                                       age_mixing_tar,
                                       household_tar*edge_tar, 
                                       gen_r_tar*edge_tar,
                                       affinal_r_tar*edge_tar,
                                       dist_tar*edge_tar), 
                      coef.diss = dissolution_coefs(dissolution = ~offset(edges), duration = 3),
                      set.control.ergm = control.ergm(MCMLE.maxit = 40,
                                                            MPLE.samplesize = 1e8,
                                                            MCMC.burnin = 6e7, 
                                                            MCMC.interval = 7e5,
                                                            SAN.control = control.san(SAN.nsteps = 1e8)), edapprox = T)
krivit commented 3 years ago

This may be related to https://github.com/statnet/ergm-private/issues/332 .

chad-klumb commented 3 years ago

My best guess is that this is caused by the use of an offset term that @chad-klumb helped me with as an alternative to using bd().

Is the size substantially lower without the offset, or with the offset specified differently (such as a named vertex attribute rather than a formula)?

ThomasKraft commented 3 years ago

@chad-klumb It appears that the method of specifying the edgecov variables is driving major size differences, particularly writing the formula using edgecov(variable) versus edgecov("variable") with the latter case stored as a network attribute. At first I was tempted to think that this behavior was a result of the size of the network object itself, but running pryr::object_size(fit$constraints) suggests that it is still the constraint that is large.

I am not sure how to specify the offset as a named attribute but still maintain the current functionality. Do you mean using a nodematch term within offset?

Here is a working example for the size issue:

library(EpiModel)
library(pryr)

num=1000
toy <- network_initialize(n=num, directed=F)
toy <- set_vertex_attribute(toy, "com_id", c(rep("1", num/4),rep("2", num/4),rep("3", num/2)))
toy <- set_vertex_attribute(toy, "fid", sample(1:40, num, replace=T))

df <- data.frame(pid = toy %v% "vertex.names", com_id = toy %v% "com_id",
                 com_id = toy %v% "com_id",
                 fid= toy %v% "fid")

# edge attributes
kin<-matrix(sample(seq(0,.5, .1), num^2, replace=T),num)
kin[lower.tri(kin)] = t(kin)[lower.tri(kin)]
diag(kin) <- 0

dist<-matrix(sample(1:10000, num^2, replace=T),num)
dist[lower.tri(dist)] = t(dist)[lower.tri(dist)]
diag(dist) <- 0
dist <- log(dist+1)

for(i in unique(df$fid)){
  for(j in unique(df$fid)){
    dist[which(df$fid == i), which(df$fid == j)] <- rnorm(1, 8, sd=2)
  }
}

fit_test1 <- netest(toy,
                   formation = ~edges + nodematch("fid") + edgecov(kin) + edgecov(dist) +  offset(absdiff(~as.integer(com_id))) ,
                   target.stats = c(30*num/20, 30*num/40, 30*.2*num/20, 7.5*num),
                   coef.form = c(-Inf),
                   coef.diss = dissolution_coefs(dissolution = ~offset(edges), duration = 3), edapprox = T)

toy2 <- toy
toy2 %n% "dist" <- dist
toy2 %n% "kin" <- kin

fit_test2 <- netest(toy2, 
                    formation = ~edges + nodematch("fid") + edgecov("kin") + edgecov("dist") +  offset(absdiff(~as.integer(com_id))) ,
                    target.stats = c(30*num/20, 30*num/40, 30*.2*num/20, 7.5*num),
                    coef.form = c(-Inf),
                    coef.diss = dissolution_coefs(dissolution = ~offset(edges), duration = 3), edapprox = T)

object_size(fit_test1$constraints)
object_size(fit_test2$constraints)
ThomasKraft commented 3 years ago

I forgot to mention, dropping the offset term altogether doesn't seem to reduce the size increase of the netest object, or the size of the object$constraints. I find that rather confusing since the constraints should be empty in that case, no?

krivit commented 3 years ago

Can you please drill down deeper? I.e., sapply(object$constraints, object_size), sapply(object$constraints$<BIGGEST>, object_size), etc.?

chad-klumb commented 3 years ago

The constraints should be "trivial" (i.e. ~.) with or without the offset, because you aren't specifying a constraints argument. Even trivial formulas have environments, though, and those environments could have various large objects in them.

It would be interesting to know how the size from netest compares with the size from a direct call to ergm. There are some environment modifications in netest (specifically lines 189 and 211) that affect the formula and constraints of the ergm fit:

https://github.com/statnet/EpiModel/blob/7bf5abcc3a376d1e1ecc288e77d9f94d7acfe12a/R/netest.R#L188-L211

environment() (with no arguments) returns the current evaluation environment

ThomasKraft commented 3 years ago

That is correct, looking within constraints simply returns: ~. (as well as the environment <environment: 0x7fdc40796cc0>). As such, sapply(object$constraints, object_size) doesn't return much information. In this case "56 56", but there aren't really any objects there.

But I think @chad-klumb nailed it about an issue with the environment. I've been trying to figure out why I'm getting such variable output sizes of netest objects when trying multiple times, and I had almost managed to convince myself that the control.ergm parameters were making a big difference. But if you run the example code above (sorry I fixed the typos there, it runs now) with a clean workspace, fit_test1 is ~4 MB, whereas fit_test2 is ~75 MB. netest therefore does seem to be storing some of my global environment within constraints, but it is not at all obvious why that differs between these cases (which depends on how the edgecov variables are specified).

Is there a clear best way to reduce my netest size? For example, does it make sense to clear my global environment of anything extraneous before running netest(), or can I just clear the environment within the netest object?

You might also be interested to know that this can cause major issues going forward when using netsim() -- the memory usage from an oversized netest object spirals out of control when running simulations from it.

krivit commented 3 years ago

My bad. I had mixed up constraint with constrained (which is a list).

To elaborate on what @chad-klumb wrote, a formula created inside a function will carry its environment around. One that is created outside a function will have globalenv() as its environment and will not carry it around.

statnet.common has a function trim_env(), which replaces an object's environment with one that is empty except for a selected set of variables, and nonsimp_update.formula() similarly tries to be more efficient with environments. I've made changes in the statnet/EpiModel@trim_env branch: give them a try.

ThomasKraft commented 3 years ago

Thanks @krivit. In running checks, I'm realizing that the example code I posted isn't doing a good job of reproducing the issue. In the example, the network is being stored 3 times in the model fit (model$fit$network, model$fit$newnetwork, model$fit$newnetworks), and the size difference is accounted for by the size of the matrices included as network attributes multiplied by 3.

In my empirical example that originally generated the question, it is very different (see image below). You can see >10 GB stored under both formula and constraints. Is there a way that I can see what objects are actually stored within those environments?

Once I'm done making sure I can reproduce the issue I'll re-run with the @trim_env branch.

image

ThomasKraft commented 3 years ago

@krivit unfortunately using the @trim_env branch does not appear to change the outcomes I'm experiencing.

However, I have discovered that part of the object size issue is deriving from size expansion during saving and loading an netest object (using saveRDS and readRDS). It is very simple to reproduce this issue: running the example above, if you then do the following:

object_size(fit_test2)
saveRDS(fit_test2, file="fit_test2.Rds")
rm(list=ls())

readRDS(file="fit_test2.Rds")
object_size(fit_test2)

You will see that saving and loading increases the size of the object in the environment by ~4x! This is not a compression issue as far as I can tell. Because my code saves the netest object for use again later, this seems to lie at the heart of my problem. Is there an obvious explanation for this?

ThomasKraft commented 3 years ago

Another clue: re-loading a netest object and then re-setting the environment of constraints (and formula) to the global environment brings the object size back down to expected size. netsim appears to work fine on the resulting object. This is obviously a terrible hack, but might this work as a solution?

saved_mod <- readRDS("fit_test2.Rds")
object_size(saved_mod) # ~200 MB

environment(towl$fit$constraints) <- .GlobalEnv
environment(towl$fit$formula) <- .GlobalEnv
object_size(saved_mod) # ~50MB

sim <- netsim(saved_mod, param, init, control)
ThomasKraft commented 3 years ago

Using trim_env manually on a completed netest also seems to effectively bring the size down to the expected value. Hard to see why @krivit 's changes in statnet/EpiModel@trim_env (commit bca5d83) don't therefore solve this issue.

Perhaps it is noteworthy that trim_env has differential effects on size when used on model$fit$constraints versus model$constraints.

krivit commented 3 years ago

Can you try the by-element object_size analysis for the trim_env branch? $constraints should be much smaller; and I suspect that the reason $formula is so big is that its environment isn't trimmed but rather now contains the LHS network. In fact, it might not actually take up space over and above the rest of the fit, which also contains the network, since it's the same network.

chad-klumb commented 3 years ago

There's still an assignment of constraints's environment after the model fit:

https://github.com/statnet/EpiModel/blob/bca5d83795755849add9591b6f67434b04d1cc8b/R/netest.R#L210

ThomasKraft commented 3 years ago

The by-element object_size analysis still suggests that both the constraint and formula are large.

On Wed, Feb 3, 2021, 6:02 PM chad-klumb notifications@github.com wrote:

There's still an assignment of constraints's environment after the model fit:

https://github.com/statnet/EpiModel/blob/bca5d83795755849add9591b6f67434b04d1cc8b/R/netest.R#L210

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statnet/tergmLite/issues/48#issuecomment-772912779, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGLHD4NGZVSLRGWEJY5BFDS5HPXJANCNFSM4W5DIPBQ .

krivit commented 3 years ago

https://github.com/statnet/EpiModel/blob/bca5d83795755849add9591b6f67434b04d1cc8b/R/netest.R#L210

Ah... I think it might be worth auditing EpiModel for its environment hygiene.

krivit commented 3 years ago

@ThomasKraft, how about now?

ThomasKraft commented 3 years ago

@krivit this has fixed the size issue with both $constraints and $fit$constraints!

However, the issue with formula remains unresolved (very large, expanding size during saving/loading). Regarding your comment earlier, $formula does appear to be taking up quite alot of space over the rest of the fit (not sure if this is because the LHS network is stored in that environment). Perhaps the environment for formula needs to be modified within nosimp_update.formula (https://github.com/statnet/EpiModel/blob/bca5d83795755849add9591b6f67434b04d1cc8b/R/netest.R#L188)?

martinamorris commented 3 years ago

@ThomasKraft yes, the LHS network is stored in the formula environment. As Pavel once said, "formulas are devious".

So, the places networks are stored in a fit object are:

  1. fit$network -- the original network, or, if working with egocentric data, the network produced by SAN
  2. fit$newnetwork -- the final sampled network from the estimation MCMC chain
  3. fit$newnetworks -- a list of the final networks at the end of the MCMC simulation, one for each thread (i think only for parallel estimation, but @krivit will need to confirm that.
  4. environment(fit$ergm.formula)$popnw -- the sneaky one. @krivit is this network identical to (1) above?

I've actually made use of (4) in our egocentric modeling, to ensure nodeset consistency when we run multilayer network simulations. I don't actually use it in any EpiModel code, it's all done in pre-processing steps, so if we want to strip this from the netest object, this wouldn't be a problem (for me at least).

ThomasKraft commented 3 years ago

Thank you for this very clarifying breakdown. It seems that stripping (4) from the netest object might cause problems with models including edgecov variables. I've tried modifying the $fit$formula environment directly of the netest object, and then netsim subsequently throws an error that it doesn't know where to look for the edgecov object (defined in the global environment at time of netest run). But it seems like this should be surmountable by either looking at the current global environment or within the saved network object at runtime (I think).

krivit commented 3 years ago
  1. fit$newnetworks -- a list of the final networks at the end of the MCMC simulation, one for each thread (i think only for parallel estimation, but @krivit will need to confirm that.

If it's single-threaded, then it's just a list of length 1. (It's the same network as fit$newnetwork.)

  1. environment(fit$ergm.formula)$popnw -- the sneaky one. @krivit is this network identical to (1) above?

I am not sure where it's coming from. ergm.ego constructs a network by that name, but EpiModel doesn't use ergm.ego (yet).

ThomasKraft commented 3 years ago

I notice that when looking within the netest object, running ls(environment(model$fit$formula)) returns only "TARGET_STATS", which appears to be just a static network object that is identical() to what @martinamorris listed as (1). This would suggest that it's redundant (at least in my case), but as noted above netsim seems to look there edgecov variables specified in the formula.

martinamorris commented 3 years ago

@krivit, the current EpiModel uses a call to ergm with target.stats argument. what does ergm return in the fit$formula environment in that case?

krivit commented 3 years ago

It's an environment containing TARGET_STATS, with the formula's original environment as its parent environment.

ThomasKraft commented 3 years ago

Can you help me understand why object_size(fit_test2$fit$formula) is exactly twice as large as object_size(environment(fit_test2$fit$formula)$TARGET_STATS)?

This appears to be at the heart of the memory duplication that is occurring, and I suspect it is because the network is being stored in the formula's environment. If this is the case, then I believe it creates my original issue when saving the netest object and re-loading because the relationship between environments becomes unlinked.

ThomasKraft commented 3 years ago

To help with diagnosis, here is a minimal example for the remaining memory management problem:

library(EpiModel)
library(pryr)
toy <- network_initialize(n=1000, directed=F)
ft <- netest(toy, formation = ~edges  ,
                    target.stats = c(1500),
                    coef.diss = dissolution_coefs(dissolution = ~offset(edges), duration = 10), edapprox = T)
object_size(ft)  #1.94 MB

save.image(file = "envir.RData")
rm(list=ls())
load("envir.RData")

object_size(ft)  #13.5 MB
martinamorris commented 3 years ago

I can't explain it, but I see it. (though my values for the saved object are a bit bigger than yours). A more complete reprex:

library(EpiModel)
library(pryr)
toy <- network_initialize(n=1000, directed=F)
object_size(toy)    # 418 KB, so this is tiny
ft <- netest(toy, formation = ~edges,
             target.stats = c(1500),
             coef.diss = dissolution_coefs(dissolution = ~offset(edges), 
                                           duration = 10), 
             edapprox = T)
object_size(ft)   # 1.94 MB
object_size(environment(ft$fit$formula))  # 1.94 MB the same as the entire fit object?
environment(ft$fit$formula)$TARGET_STATS
object_size(environment(ft$fit$formula)$TARGET_STATS) # 1.27 MB

save.image(file = "envir.RData")
rm(list=ls())
load("envir.RData")

object_size(ft)     # 18.5 MB
object_size(environment(ft$fit$formula))   # 15.8 MB
object_size(ft$fit$newnetwork)  #  1.27 MB
environment(ft$fit$formula)$TARGET_STATS
object_size(environment(ft$fit$formula)$TARGET_STATS) # 1.27 MB
ThomasKraft commented 3 years ago

One factor that would help ameliorate memory issues with large networks would be the ability to discard saved networkDynamic objects automatically when at the end of each run. As far as I know this is no longer possible with the defunct save.network argument now. @smjenness is there an easy way to throw out the saved network object within each simulation at the end of nsteps?

smjenness commented 2 years ago

tergmLite has been disbanded (its functionality has been folded into EpiModel itself). If there are any relevant open sub-issues here, please open new issues on the EpiModel or statnet package repositories. Thanks!