Closed mbojan closed 3 years ago
Need to fix statnet/ergm#172 first.
Done.
Thank you!!!
Ok, still one problem, but i think we're 99% of the way there.
The offset in the first position is the ppop scaling adjustment.
in ergm.ego
, theta(ppop adj) + theta(edges) = in ergm
theta(edges).
So, we don't want to just delete that offset, we want to add it back to the edges theta first, then delete it.
Without this, the probabilities are not scaled correctly in the predict function:
> head(pnet)
tail head p
1 3497 7317 0.6867611
2 7272 8277 0.7497954
3 7069 11762 0.4357120
4 435 8813 0.5929972
5 1275 8594 0.0000000
6 10672 10806 0.5315793
those probabilities should be much, much lower in this network with 15K nodes and 7K edges.
i'm tagging both @chad-klumb and @mbojan here, not sure who can get to it first.
A possible alternative would be to use the ergm.formula
rather than the formula
attached to the ergm.ego
object; then no removal or modification should be necessary.
cf. PR statnet/ergm.ego#48
you can remotes::install_github("chad-klumb/ergm.ego@master")
to test (until it's merged)
thx for the quick response! will try when i'm free (6:30ish)
I was actually wondering if it should be in fact included...
We could also modify ergmMPLE()
to return foffset
element computed by ergm.pl()
which is an already computed combined "effect" of the offset terms (sum of products of offset terms and their coefficients).
@chad-klumb , can you verify for a model with more than one offset term (plus net netsize adjustment) if the order of the parameter vector is the same as the columns of the "model matrix" (as returned by ergmMPLE()
)? I recall that for some reason if ergmMPLE()
was fed with ergmegofit$ergm.formula
the netsize adjustment offset was moved from the first place in the formula.
I will be able to get back to it in 8-10hrs.
It looks like the problem is actually worse than that; it appears to move offsets (not just netsize.adj
) to the last columns.
That will need to be changed or compensated for.
Right. I did not manage yet to pin down why it happens. I think we need a reliable way of mapping formula terms to model matrix columns and to parameters. It will be invaluable in other contexts too.
All in all that means the problem affects ergm too, not only ergm.ego.
The rearrangement happens here:
https://github.com/statnet/ergm/blob/38533f42b4e302f05e905145174bf105270f0cbc/R/predict.ergm.R#L103
nonsimp_update.formula
may be the solution
With both chad-klumb/ergm@master
and chad-klumb/ergm.ego@master
, the following script
library(ergm.ego)
data("faux.magnolia.high")
fmhego <- as.egodata(faux.magnolia.high)
fit <- ergm.ego(fmhego ~ edges
+ nodefactor("Grade")
+ offset(nodemix("Grade", levels2=1))
+ nodematch("Grade")
+ offset(nodematch("Sex")),
offset.coef = c(-1,-2))
debug(predict)
pnet <- predict(fit)
shows that predmat
is structured as
Browse[5]> head(predmat)
tail head offset(netsize.adj) edges nodefactor.Grade.8 nodefactor.Grade.9 nodefactor.Grade.10 nodefactor.Grade.11 nodefactor.Grade.12 offset(mix.Grade.7.7) nodematch.Grade
[1,] 116 267 1 1 0 1 0 1 0 0 0
[2,] 57 1433 1 1 0 0 0 1 1 0 0
[3,] 610 1167 1 1 0 0 1 0 0 0 0
[4,] 38 553 1 1 0 1 0 0 0 0 0
[5,] 1043 1081 1 1 0 0 1 1 0 0 0
[6,] 625 1317 1 1 1 1 0 0 0 0 0
offset(nodematch.Sex)
[1,] 0
[2,] 0
[3,] 1
[4,] 0
[5,] 0
[6,] 1
which I believe is the desired result, as
Browse[5]> fit$ergm.formula
popnw ~ offset(netsize.adj) + (edges + nodefactor("Grade") +
offset(nodemix("Grade", levels2 = 1)) + nodematch("Grade")) +
offset(nodematch("Sex"))
<environment: 0x000000000833e530>
Browse[5]> fit$coef
offset(netsize.adj) edges nodefactor.Grade.8 nodefactor.Grade.9 nodefactor.Grade.10 nodefactor.Grade.11 nodefactor.Grade.12 offset(mix.Grade.7.7)
-7.2868764 0.1513628 -0.3380101 -0.7668909 -0.6618398 -0.5076787 -0.5141862 -1.0000000
nodematch.Grade offset(nodematch.Sex)
3.4419170 -2.0000000
aaaaaand. we're back. this time with a head-scratcher. i believe @chad-klumb 's code is working fine. and predict()
without offsets is working fine. and predict()
with offsets respects the offsets (i.e., p(tie) for -Inf offset values is 0).
but offsets (esp. -Inf offsets) change the number of possible edges from the number of dyads to something less. in the example below, no same-sex ties will be allowed. this reduces the possible edges to 1/2 of the total dyads (only M-F, and undirected in this case.
i don't think predict
takes this into account when calculating the p(tie). it does the right thing in terms of preventing same-sex ties. but doesn't calculate the probability of a tie for the possible ties correctly.
reprex below
data("faux.magnolia.high")
# first without offsets
fit <- ergm(faux.magnolia.high ~ edges
+ nodefactor("Grade")
+ nodematch("Grade")
+ nodematch("Sex"))
pnet <- predict(fit) # data frame
vnet <- bind_rows(tibble(vertex.names=pnet[,1], p=pnet[,3]),
tibble(vertex.names=pnet[,2], p=pnet[,3]))
deg <- vnet %>%
group_by(vertex.names) %>%
summarize(deg = sum(p))
egos <- ergm.ego::as.egodata(faux.magnolia.high)$egos %>%
mutate(vertex.names = as.numeric(vertex.names)) %>%
right_join(deg)
## roughly equal: stubs/2 = edges
sum(egos$deg)/2
summary(faux.magnolia.high ~ edges)
# now with offset - no samesex ties
fit <- ergm(faux.magnolia.high ~ edges
+ nodefactor("Grade")
+ nodematch("Grade")
+ offset(nodematch("Sex")),
offset.coef = (-Inf))
pnet <- predict(fit) # data frame
vnet <- bind_rows(tibble(vertex.names=pnet[,1], p=pnet[,3]),
tibble(vertex.names=pnet[,2], p=pnet[,3]))
deg <- vnet %>%
group_by(vertex.names) %>%
summarize(deg = sum(p))
egos <- ergm.ego::as.egodata(faux.magnolia.high)$egos %>%
mutate(vertex.names = as.numeric(vertex.names)) %>%
right_join(deg)
## not even close: stubs/2 != edges
sum(egos$deg)/2
summary(faux.magnolia.high ~ edges)
## let's look at the expected number of same and opp sex ties
pnet_sex <- fmhego$egos %>%
select(vertex.names, Sex) %>%
mutate(vertex.names = as.numeric(vertex.names)) %>%
right_join(pnet, by = c("vertex.names" = "tail" )) %>%
mutate(sex_tail = Sex) %>%
select (-Sex)
pnet_sex <- fmhego$egos %>%
select(vertex.names, Sex) %>%
mutate(vertex.names = as.numeric(vertex.names)) %>%
right_join(pnet_sex, by = c("vertex.names" = "head" )) %>%
mutate(sex_head = Sex) %>%
select (-Sex)
## prediction respects offsets in p(tie) = 0 for samesex ties
## but p(tie) not adjusted accordingly for possible ties
pnet_sex %>%
group_by(sex_tail, sex_head) %>%
summarize(count = n(),
pred = sum(p))
output of that last bit is:
# A tibble: 4 x 4
# Groups: sex_tail [2]
sex_tail sex_head count pred
<chr> <chr> <int> <dbl>
1 F F 276228 0
2 F M 240949 127.
3 M F 258065 140.
4 M M 224814 0
predict
shouldn't need to be adjusted to handle this
The problem is likely that the model is inconsistent with the data and only the MPLE is run by default since the model is dyad-independent.
> summary(faux.magnolia.high ~ nodemix("Sex"))
mix.Sex.F.F mix.Sex.F.M mix.Sex.M.M
430 285 259
while with the offset
> sum(egos$deg)/2
[1] 267.1421
which is consistent.
Forcing the main method
fit <- ergm(faux.magnolia.high ~ edges
+ nodefactor("Grade")
+ nodematch("Grade")
+ offset(nodematch("Sex")),
offset.coef = (-Inf),
control=control.ergm(force.main=TRUE))
pnet <- predict(fit) # data frame
vnet <- bind_rows(tibble(vertex.names=pnet[,1], p=pnet[,3]),
tibble(vertex.names=pnet[,2], p=pnet[,3]))
deg <- vnet %>%
group_by(vertex.names) %>%
summarize(deg = sum(p))
egos <- ergm.ego::as.egodata(faux.magnolia.high)$egos %>%
mutate(vertex.names = as.numeric(vertex.names)) %>%
right_join(deg)
## not even close: stubs/2 != edges
sum(egos$deg)/2
summary(faux.magnolia.high ~ edges)
also gives
> sum(egos$deg)/2
[1] 914.3843
> summary(faux.magnolia.high ~ edges)
edges
974
which again is consistent.
thx @chad-klumb. a couple of questions:
offsets are often used in glms, and these are typically dyad-independent models that rely on the estimation algorithm we use for MPLE. i thought that estimate was, in fact, the MLE in the glm setting. so why doesn't the MPLE work properly? shouldn't it still try to match the edge and nodefactor counts by redistributing the forbidden ties to allowed dyads?
i'm getting a similar result -- (sum(predicted edges) != sum(observed edges) -- in my predictions from my MCMLE ergm fits in the WHAMP context. do i understand correctly from your comment above that the MCMLE fits should produce the right predicted degree? if so, then i still have another problem to dx.
predict
may need to set maximum number of dyads or dyad types appropriately to ensure all dyads are represented in the result; this may explain your 2. if the return value from predict
has fewer rows than your network's dyadcount; this would generally be an issue for larger networks (but has nothing to do with offsets)
dyads with infinite offset effects are omitted from the MPLE:
https://github.com/statnet/ergm/blob/8b00975987aac64f4009279a88aaf51305e40bdb/R/ergm.pl.R#L143-L148
Thx. It looks like even in the faux.magnolia.high example, the total dyads returned by predict()
is less than n choose 2. Is that because predict maxes out at some fixed value? The # returned dyads is 1,000,056, 1461 choose 2 = 1,066,530. Maybe this explains why the sum(pred edge) < obs(edgecount).
And the number of dyads with n=15K (my WHAMP case) is much larger.
So, do we want to:
Sample from that huge space to get estimates of the predicted tie values. My concern is that the space is large, and some of our subgroups are small, so we'll miss them unless we stratify the sample, and that starts getting complicated.
Bump up the MPLE to sample all possible dyads. I'm not even sure this is possible.
If neither of these options is workable, then we're at a dead end, yes?
--
Since predict
propagates the ...
you can set the dyad types option as follows:
pnet <- predict(fit, control = control.ergm(MPLE.max.dyad.types=1.2e8))
The default is 1e6
. IIRC your network is 15k undirected so you've got under 1.2e8
dyads and hence the number of unique dyad types shouldn't need to exceed that.
If that runs without the warning
Warning message:
In ergm.pl(nw, fd, model, verbose = verbose, control = control, :
Too many unique dyads. MPLE is approximate, and MPLE standard errors are suspect.
then you should be good as far as getting all the dyads goes.
If results still don't match expectations after that, then there could be yet another problem.
I've updated the PR on ergm
to take all informative dyads automatically; once you upgrade to that version you should not need to set the control
argument anymore.
I would also add that since predict
adds indices
to the formula each (informative) dyad will have its own unique type, so the number of unique dyad types we want to keep is the same as the number of informative dyads.
Thx @chad-klumb. I'm guessing the number of indices is small when we're working with factor level covariates, and rises quickly with continuous covariates. Downloading from your repo now, will let you know what happens.
@chad-klumb, still getting the wrong edgecount with offsets from your latest ergm commit, though the MPLE warning is gone. but i can see now what it's doing:
> mixingmatrix(faux.magnolia.high, "Sex")
Note: Marginal totals can be misleading
for undirected mixing matrices.
F M
F 430 285
M 285 259
when you use the offset to prohibit samesex ties, sum(predicted tie) = 285.
> fit <- ergm(faux.magnolia.high ~ edges
+ + nodefactor("Grade")
+ + nodematch("Grade")
+ + offset(nodematch("Sex")),
+ offset.coef = (-Inf))
Starting maximum pseudolikelihood estimation (MPLE):
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Stopping at the initial estimate.
Evaluating log-likelihood at the estimate.
> pnet <- predict(fit) # data frame
> vnet <- bind_rows(tibble(vertex.names=pnet[,1], p=pnet[,3]),
+ tibble(vertex.names=pnet[,2], p=pnet[,3]))
> deg <- vnet %>%
+ group_by(vertex.names) %>%
+ summarize(deg = sum(p))
`summarise()` ungrouping output (override with `.groups` argument)
> egos <- ergm.ego::as.egodata(faux.magnolia.high)$egos %>%
+ mutate(vertex.names = as.numeric(vertex.names)) %>%
+ right_join(deg)
Joining, by = "vertex.names"
> sum(egos$deg)/2
[1] 285
> summary(faux.magnolia.high ~ edges)
edges
974
So, is it deleting the forbidden dyads at the start, leaving an edgecount of 285 in the residual dyads for the model fit? If so, this presumably would only delete observed edges if they existed in the forbidden dyads. They do in faux.magnolia. But they don't in my WHAMP networks -- will try those next.
on the WHAMP network: the object appears to be too big at some point along the calculation
> pnet <- predict(fit_casl) # data frame
Error in ergm.pl(nw, fd, model, verbose = verbose, control = control, :
long vectors (argument 13) are not supported in .C
It exceeds a limitation of the C API used by master
.
It may be possible to get it working with dev
. Other issues with exceeding 32 bit limits may also arise.
I have made a minimal set of changes to a dev
branch and they superficially appear to be working for a toy model.
The install stack is
remotes::install_github(c("statnet/rle",
"statnet/statnet.common",
"statnet/network",
"statnet/ergm-private@predict_dev_test",
"chad-klumb/ergm.ego@master"), dependencies=FALSE, force=TRUE)
An example usage is
require(ergm)
netsize <- 15000
nw <- network.initialize(netsize,dir=F)
nw <- san(nw ~ edges, target.stats=netsize)
nw %v% "attr" <- sample(letters[1:7], netsize, TRUE)
fit <- ergm(nw ~ edges + nodemix("attr", levels2=-1))
pnet <- predict(fit)
No control argument should be needed.
It will take a fair amount of memory and at least several minutes of runtime so I would use the Rstudio servers.
The faux.magnolia
example should not be expected to have predicted total # edges = observed total # edges
when the MPLE is used as the coefficient estimate for the reasons explained above; the WHAMP stuff presumably should not have the same problem.
The issue is essentially about ERGMs with offset with ergm.ego-s as a special case of interest. I'm moving the issue to ergm
.
I've updated the PR on
ergm
to take all informative dyads automatically; once you upgrade to that version you should not need to set thecontrol
argument anymore.
@chad-klumb has it been merged by Pavel? I don't think I see it in the history.
i don't think
predict
takes this into account when calculating the p(tie). it does the right thing in terms of preventing same-sex ties. but doesn't calculate the probability of a tie for the possible ties correctly.
@martinamorris I'm trying to understand what should be the expected behavior in case of models with offsets and -Inf coefs. A small-scale example so that probs can be worked-out by hand:
# 4 nodes; 6 dyads
net <- network.initialize(4, directed=FALSE)
net %v% "a" <- c(1,1,2,2)
# Within 1: log(20/12); P=20/32=5/8
# Within 2: log(20/12) + 2*log(1/2) = log(20/12 * 1/4) = log(5/12); P=5/17
# Between: log(20/12) + log(1/2) = log(5/6); P=5/11
d1 <- predict(
net ~ edges + nodefactor("a"),
theta = c(log(20/12), log(1/2))
)
d1 # as computed above
## tail head p
## 1 2 3 0.4545455
## 2 1 4 0.4545455
## 3 3 4 0.2941176
## 4 1 2 0.6250000
## 5 1 3 0.4545455
## 6 2 4 0.4545455
d2 <- predict(
net ~ edges + nodefactor("a") + nodematch("a"),
theta = c(log(20/12), log(1/2), -Inf)
)
d2 # nodematch offset does not affect between group dyads
## tail head p
## 1 1 2 0.0000000
## 2 1 3 0.4545455
## 3 1 4 0.4545455
## 4 2 3 0.4545455
## 5 2 4 0.4545455
## 6 3 4 0.0000000
# identical to previous
d3 <- predict(
net ~ edges + nodefactor("a") + offset(nodematch("a")),
theta = c(log(20/12), log(1/2), -Inf)
)
# All together now:
dplyr::full_join(d1, d2, by=c("head", "tail")) %>%
dplyr::full_join(d3, by=c("head", "tail"))
## tail head p.x p.y p
## 1 2 3 0.4545455 0.4545455 0.4545455
## 2 1 4 0.4545455 0.4545455 0.4545455
## 3 3 4 0.2941176 0.0000000 0.0000000
## 4 1 2 0.6250000 0.0000000 0.0000000
## 5 1 3 0.4545455 0.4545455 0.4545455
## 6 2 4 0.4545455 0.4545455 0.4545455
So the -Inf nodematch offset zeroes the within-group tie probabilities but does not affect anything else because nodematch is 0 for between-group dyads and 0 * -Inf = 0 (on the log scale). Or am I misunderstanding something?
I've updated the PR on
ergm
to take all informative dyads automatically; once you upgrade to that version you should not need to set thecontrol
argument anymore.@chad-klumb has it been merged by Pavel? I don't think I see it in the history.
I believe it was merged, yes.
predict
may need to set maximum number of dyads or dyad types appropriately to ensure all dyads are represented in the result; this may explain your 2. if the return value frompredict
has fewer rows than your network's dyadcount; this would generally be an issue for larger networks (but has nothing to do with offsets)dyads with infinite offset effects are omitted from the MPLE:
https://github.com/statnet/ergm/blob/8b00975987aac64f4009279a88aaf51305e40bdb/R/ergm.pl.R#L143-L148
predict()
uses xmat.full
so the result always has all the dyads
If MPLE.max.dyad.types
is exceeded then the sampling stops, and you won't get all the (free) dyads; it doesn't matter which part of the return value is used.
I assume (and correct me if I'm wrong) that the intent was to predict tie probabilities for all free dyads.
Since indices
is a term in the formula passed to the MPLE, every free dyad has its own unique type, so the number of unique dyad types required is the same as the number of free dyads, which can be larger than the default MPLE.max.dyad.types
of 1e6
.
If
MPLE.max.dyad.types
is exceeded then the sampling stops, and you won't get all the (free) dyads; it doesn't matter which part of the return value is used.
Ah, right. You are correct.
I assume (and correct me if I'm wrong) that the intent was to predict tie probabilities for all free dyads.
Yes, the intention is to have all by default.
Since
indices
is a term in the formula passed to the MPLE, every free dyad has its own unique type, so the number of unique dyad types required is the same as the number of free dyads, which can be larger than the defaultMPLE.max.dyad.types
of1e6
.
Yep.
Also (and maybe this was a point of confusion), the two parts of my response to Martina that you quoted above were responding to different points she had raised; the thing about dyads with infinite offset effects being omitted from the MPLE was addressing her 1. below, while the thing about setting maximum number of dyad types was addressing her 2. below.
thx @chad-klumb. a couple of questions:
- offsets are often used in glms, and these are typically dyad-independent models that rely on the estimation algorithm we use for MPLE. i thought that estimate was, in fact, the MLE in the glm setting. so why doesn't the MPLE work properly? shouldn't it still try to match the edge and nodefactor counts by redistributing the forbidden ties to allowed dyads?
- i'm getting a similar result -- (sum(predicted edges) != sum(observed edges) -- in my predictions from my MCMLE ergm fits in the WHAMP context. do i understand correctly from your comment above that the MCMLE fits should produce the right predicted degree? if so, then i still have another problem to dx.
Yes, apparently... Sorry!
No worries; I just wanted to be sure I wasn't confusing anyone (including myself!).
So to summarise where we stand:
ergm.pl()
or ergmMPLE()
as predict.formula()
basically digests their outputpredict.formula()
too look at dyadcount of the supplied network and possibly adjust the MPLE.max.dyad.types
control option. Or perhaps more preferably: just check one against the other and give a warning/suggestion to the user to set it manually.There are still some points to consider regarding predict
. I'm putting them here since this is the running predict
issue atm. We don't have to do all of this for 3.11
, though.
Currently predict
does not accept or propagate (from the fits) the constraints
; this may lead to incorrect predictions for dyads that are fixed by the constraints
formula but not (also) by infinite offsets in the model formula.
Beyond this there is the issue of dyad-dependent constraints; I believe we do not currently have a dyad-dependent "conditionally free dyads" API, so even if constraints
was handled by predict
, if someone had e.g. ~bd(maxout=1)
but no corresponding offset in the generative model, they may wind up with incorrect predictions for "conditionally structural zeroes".
As currently implemented I don't think there's a way for the user to override the (new) default of getting all the free dyads; we may wish to modify the implementation so the user can pass a different number of dyads to sample, if desired (with all free dyads remaining the default).
In master
, limitations of the .C
API mean that we can't have (# free dyads) x (# changestats)
exceed 32 bit limits. We may wish to check for this before the .C
call and give the user an informative error if it's going to occur (since the long vector
error from R
isn't optimally clear).
In dev
(which uses .Call
instead of .C
), we can exceed this limit, but we will run into problems with 32 bit variables in the MPLE
C code. If the number of free dyads itself exceeds 32 bit limits, we will also need to increase MPLE.samplesize
, which defaults to .Machine$integer.max
(32 bit max), and change how it's handled by the C code (since it can't be a 32 bit integer type anymore). This is part of a more general problem of some things not working beyond 32 bit limits, even if we may quite reasonably want to exceed those (but fixing that is a 4.0
or beyond issue, not a 3.11
issue).
The MPLE
hash table machinery bogs down considerably as the number of unique sampled dyad types approaches the maximum number of unique dyad types. Currently, the maximum number of unique dyads types is capped at the number of free dyads in ergm.pl
(note that d
is the number of free dyads):
https://github.com/statnet/ergm/blob/749563012d464899d017c4142b9c90541ff69992/R/ergm.pl.R#L44-L45
So, in typical predict
usage, we will always completely fill the hash table (just as we finish sampling the free dyads), since indices
gives each dyad its own unique type. When the number of free dyads is large, things become extremely slow as the final small fraction of free dyads is sampled. Increasing the maximum number of unique dyad types (i.e. the number of rows in the hash table) beyond what is actually needed can avoid this significant slowdown; in a simple test case with 15k nodes and just edges
as the generative model, current master
had not finished the predict
call in 2 hours, while a test version changing only min(...,d)
to min(...,2*d)
in the quoted code above finished predict
in ~ 1 minute. Doubling the number of rows may be more than is needed, but it demonstrates the point.
From my (admittedly fast) read of this thread, the original issue in ergm
master
has been resolved, so this is no longer a 3.11 release blocker. Is that right?
I believe the original issue has been resolved.
I would consider the linked PRs for inclusion in 3.11; they are small and basically independent of each other, so should be easy to review quickly, and I think they significantly improve performance/user friendliness.
With or without those PRs, I would say this is no longer blocking.
I think so too.
@chad-klumb perhaps it is worthwhile to put:
There are still some points to consider regarding
predict
. I'm putting them here since this is the runningpredict
issue atm. We don't have to do all of this for3.11
, though. (...)
in separate issues for further reference down the road?
Okay, I think this issue has run its course so I'm closing it now.
I will post remaining todos elsewhere (possibly in dev
) at a later date.
Implementpredict.ergm.ego()
which will compute predicted probabilities for the pseudo-population network.