statnet / ergm

Fit, Simulate and Diagnose Exponential-Family Models for Networks
Other
94 stars 36 forks source link

In ergm(), rename TARGET_STATS to something more aesthetic and less likely to be used as a variable name? #476

Open krivit opened 1 year ago

krivit commented 1 year ago

Right now, if ergm() is given target.stats=, it runs san() to obtain the LHS network, calls it TARGET_STATS, then replaces the LHS of the model formula with TARGET_STATS (the name), augmenting its environment to contain said variable.

There are two reasons to change this:

  1. If TARGET_STATS is a variable used outside of ergm(), that could create a subtle bug, since the formula would no longer be able to see it.
  2. R variable names can technically be any strings, if quoted. For example,
    > `Hello World!` <- 2+2
    > ls()
    [1] "Hello World!"
    > `Hello World!`
    [1] 4

    We can therefore use an expression that is more aesthetically pleasing (say, for summary() output) or is more clearly meta-syntactic.

And so, I have two questions:

  1. Is anyone, particularly @statnet/epimodel developers, currently relying on TARGET_STATS?
  2. What names would be good? Some ideas so far:
    • <target stats> ~ edges
    • [target stats] ~ edges
smjenness commented 1 year ago

@krivit : we are not using TARGET_STATS nor the two proposed alternatives. I agree the alternatives may be safer.

mbojan commented 1 year ago

@mbojan Do you mean in the original ergm call? A network is needed on the LHS in order to provide all of the info about the meta-information about the network and, most importantly, all of the attribute values. SInce the latter are of arbitrary number and type it would be quite messy (and redudnant) to recreate all of that info in another format.

Right. Disregard my comment. Got carried away by a quest to overload the current interface...

martinamorris commented 1 year ago

Hmm. Calling a network "target stats" (in any format) seems a bit odd to me. Needless to say I hadn't realized we were already doing that. Would it make sense to use something like "target net" instead?

krivit commented 1 year ago

Hmm. Calling a network "target stats" (in any format) seems a bit odd to me. Needless to say I hadn't realized we were already doing that. Would it make sense to use something like "target net" instead?

It's because we are fitting to statistic, not to the network. At least, that's the reasoning.

mbojan commented 1 year ago

Is this really a problem? It will happen only if the symbol TARGET_STATS is used by the user in the ergm formula with the intention of referring to an object with that name in, say, global environment. That sounds rather unlikely to me. If we think it is likely enough we can throw an error/warning if symbol TARGET_STATS has been used in the formula by the user. For reference, the source code of interest is here:

https://github.com/statnet/ergm/blob/554b40d09fa64e217af0a0352266b7df25761900/R/ergm.R#L581-L603

A separate issue is about the output: what should summary(ergmfit) show for a model fit to vector of sufficient statistics rather than a network object. At this moment I believe you can't tell one from the other.

krivit commented 1 year ago

I think that in practice, the benefits will be largely aesthetic. In particular, summary.ergm() would show [target stats] ~ edges rather than TARGET_STATS ~ edges and similar.

mbojan commented 1 year ago

In the output IMHO [target stats] or <target stats> would look most "meta-" in the sense of not suggesting it is an R object to be interacted with, thus preferable.

CarterButts commented 1 year ago

I think one would want to avoid including whitespace in the name, as this can complicate parsing.  Is the intent for this to be user-exposed?  If so, we should let them use their own name and detect the object type after parsing.  Otherwise, the current scheme seems OK. TARGET_STATS_LHS is uglier, but would be pretty collision-proof, if that is the concern.  Whole thing seems a bit in the not broke/don't fix category.

moxboxwa commented 1 year ago

Hmm. Calling a network "target stats" (in any format) seems a bit odd to me. Needless to say I hadn't realized we were already doing that. Would it make sense to use something like "target net" instead?

It's because we are fitting to statistic, not to the network. At least, that's the reasoning.

I get that the input is target stats. But really, what we're fitting to is a network with those target stats. If we could fit to the target stats directly we wouldn't need the SAN step.

My suggestion preserves the "target" language and clarifies that a network is still on the LHS.

chad-klumb commented 1 year ago

Why not consistently allow functions that pull a network from the LHS to accept a basis argument (overriding the LHS if present), as already done for simulate() and summary()? That would eliminate the need to modify the formula or its environment (and hence to worry about the name we give the san network). If done for ergm(), it would also allow for the removal of this line in EpiModel

https://github.com/EpiModel/EpiModel/blob/7a3dc83197c5398044a2688c2182be2733a5d4d3/R/netest.R#L225

which currently uses the much-more-likely-to-be-a-problem name nw.

chad-klumb commented 1 year ago

Actually, it looks like ergm() had this functionality added already in

https://github.com/statnet/ergm/commit/8e01faed16f8d96d3def8740929b0910aacf66d1.

chad-klumb commented 1 year ago

As far as print.summary.ergm with print.formula = TRUE goes, it seems we have some inconsistency in how target.stats and basis are handled. With ergm@master, the following script

require(ergm)
nw1 <- network(10, directed = FALSE)
nw2 <- network(10, directed = FALSE)

e1 <- ergm(nw1 ~ edges)
s1 <- summary(e1)
print(s1, print.formula = TRUE)

e2 <- ergm(nw1 ~ edges, target.stats = 10)
s2 <- summary(e2)
print(s2, print.formula = TRUE)

e3 <- ergm(nw1 ~ edges, basis = nw2)
s3 <- summary(e3)
print(s3, print.formula = TRUE)

produces (in part) the respective formula printouts

Formula:
nw1 ~ edges

Formula:
TARGET_STATS ~ edges

Formula:
nw1 ~ edges

So, even though nw2 is used as the basis for e3, the formula printed for s3 is nw1 ~ edges.

If the desire is to have print.summary.ergm with print.formula = TRUE print some representation of the "effective" ergm() formula (rather than the thing actually passed to ergm() as the formula argument), then we should probably change the basis printing behavior.

For consistency and simplicity, we could do the following:

Note that this does not require modifying formula or its environment.

(Another option, which also does not require modifying formula or its environment, would be to have print.formula = TRUE always print the original formula argument to ergm(), understanding that basis and/or target.stats can be printed as part of the call.)


As I see it, modifying the formula as in

https://github.com/statnet/ergm/blob/61ebeea9e83cf906a39f60c253ede05f77da03b6/R/ergm.R#L603

and then subsequently calling ergm_model on it (as in, e.g., a later logLik() or simulate() call) is fundamentally a bug (whether TARGET_STATS or some non-syntactic alternative is used as the network name) that should be avoided if possible, and it does seem that it should be avoidable here.

krivit commented 1 year ago

Thanks! In terms of programming, having an "illegal" name is not a problem, since R variable names can be anything.

For example, here I changed the implementation to use target stats (with a space), which passes the check and prudces the following (truncated) output:

library(ergm)
nw1 <- network(10, directed = FALSE)
nw2 <- network(10, directed = FALSE)

e1 <- ergm(nw1 ~ edges)
s1 <- summary(e1)
print(s1, print.formula = TRUE)
#> nw1 ~ edges

e2 <- ergm(nw1 ~ edges, target.stats = 10)
s2 <- summary(e2)
print(s2, print.formula = TRUE)
#> `target stats` ~ edges

e3 <- ergm(nw1 ~ edges, basis = nw2)
s3 <- summary(e3)
#> nw1 ~ edges

All that having been said, I suppose that aesthetics are not that important, since by default we print the call, not the formula. (I think we used to print the formula.)