Collinearity diagnostics for ergms

martinamorris commented 5 years ago

Putting this here for now:

Date: Fri, 19 Apr 2019 17:42:16 +0000 From: "Duxbury, Scott W." duxbury.5@buckeyemail.osu.edu To: "morrism@uw.edu" morrism@uw.edu Subject: Collinearity diagnostic for ERGM Hi Martina,

It was great to meet you today. Thanks again for the talk--it was very interesting. Attached is my paper developing a collinearity diagnostic for ERGM, and linked at the end of this email is my GitHub repository with R code for implementing the measure.

Best,

Scott Duxbury PhD Candidate Department of Sociology The Ohio State University

https://github.com/sduxbury/vif-ergm/blob/master/Collinearity%20test%20for%20ERGM.R R function to detect multicollinearity in ERGM. Contribute to sduxbury/vif-ergm development by creating an account on GitHub.

Michal's remaining TODO items

[x] Do not assume there is edges in the model
[x] Do not assume the edges is a first term in the formula
[x] Implement GVIFs
[ ] Add examples
[ ] Documentation
[ ] Now source= (defaults to "model") is passed to vcov.ergm(). Let it optionally accept a positive integer so that the variance-covariance matrix is based on a sample sufficient statistics from that many networks (as in fact suggested in Duxbury 2018)
[ ] Iron-out print.vif_ergm().

mbojan commented 4 years ago

@martinamorris , i'm on it.

krivit commented 4 years ago

Not quite VIF, but it's a start. I still need to implement tests for it.

mbojan commented 4 years ago

I started working on this on mbojan/ergm@i88-vif but had to pause because the end of semester cought up with me.

I do have a question though @krivit @martinamorris . To calculate VIFs Scott's prototype code uses fit@vcov but the paper advertises rather using the correlation matrix based on simulating networks from the model. These are not the same. What's actually the proper approach here?

martinamorris commented 4 years ago

@krivit will need to answer this one :)

krivit commented 4 years ago

Are they inverses of each other (exact or approximate), by any chance?

krivit commented 3 years ago

For now, I think we need to resolve several issues before incorporating the code, if we want to do that in the first place.

Package car has a vif generic and a vif.default() method that works for any object that can return a model.matrix. However, it makes an assumption that the intercept coefficient is named "(Intercept)", which doesn't apply in our case. The current implementation of vif_ergm makes an assumption that the intercept effect is the first effect in the model. This is not necessarily the case.

We can autodetect the intercept (or the set of statistics that are collectively the intercept) on the MPLE predictor matrix, by finding those columns for which a linear combination exists that produces a vector of 1s. (I am surprised that car::vif.default() doesn't do that.) I am not sure how we would do that with the MCMC sample, without attaching an edge count/sum statistic to the returned change statistics (which may not be that bad an idea).

Given that, one can use vcov(ergm.fit, sources="model") to get the variance-covariance matrix of the parameters.

Anyway, I don't think it's something we need to incorporate into 3.11.

mbojan commented 3 years ago

Pulling car as a dependency just for the generic seems like an overkill to me, it is a rather big package with dependencies unrelated to ergm (e.g. lme4, maptools, quantreg and so on). Moreover, IMHO the car:::vif.default() method is not "default" enough for us to mold an ergm object to take advantage of it: we do not have model.matrix() method, there are no terms in a sense compatible with stats::glm() and so on. It would be nice to have a generic generic vif(object, ...) UseMethod("vif") in stats or generics (the latter is not promising...). At this moment I am opting for a dedicated vif_ergm() (or other name if anybody has suggestions).

Starting with Scotts code I ended-up rewriting the code from scratch in order to:

do not assume that there is edges term in the formula
do not assume that the edges term is a first term in the formula
implement generalized VIFs that are calculated per term -- a single value for, say, nodefactor() term even if it has several parameters for different levels.

The usefulness and sensibility of the VIFs in the ERGM context is still to be proven, especially vis a vis endogeneous terms, diverse model specifications and so on. I have decided to give the user as much control as possible over how the VCov matrix is subsetted before actual VIF calculation. That involves, among other things, removing rows/cols corresponding to parameters of provided term(s). I think we need a canonical "term labeler" (hence #220) because it is needed here, and most likely in other places, to:

Produce pretty term (not coef) labels for output
The way for a user to refer to terms of a fitted model

I have implemented a naive labeler which basically returns the deparsed call from the model formula.

Comments, suggestions, alternative ideas are more than welcome, @krivit @martinamorris @drh20drh20 @CarterButts and whomever it may concern.

This is not yet merged and with several rough edges, but see it in action in the examples below:

data("faux.mesa.high")
# Largish model formula with more than one term of the same type
frm <- faux.mesa.high ~ nodefactor("Grade") + nodefactor("Race") +
  nodefactor("Sex") + nodematch("Grade",diff=TRUE) +
  nodematch("Race",diff=TRUE, levels=-c(1,4)) + nodematch("Sex",diff=FALSE) + edges
fit <- ergm(frm)

and now (print.vif_ergm() still to be developed):

No terms are dropped

vif_ergm_new(fit)
## $vif_coef
##                     coef        vif df  vif_sqrt
## 1     nodefactor.Grade.8  62.119735  1  7.881607
## 2     nodefactor.Grade.9  44.932046  1  6.703137
## 3    nodefactor.Grade.10  22.078901  1  4.698819
## 4    nodefactor.Grade.11  32.952449  1  5.740422
## 5    nodefactor.Grade.12  15.890923  1  3.986342
## 6   nodefactor.Race.Hisp  22.637335  1  4.757871
## 7  nodefactor.Race.NatAm  17.970439  1  4.239155
## 8  nodefactor.Race.Other   1.066925  1  1.032921
## 9  nodefactor.Race.White   4.130202  1  2.032290
## 10      nodefactor.Sex.M   2.390419  1  1.546098
## 11     nodematch.Grade.7  97.139873  1  9.855956
## 12     nodematch.Grade.8  15.733629  1  3.966564
## 13     nodematch.Grade.9   6.726168  1  2.593486
## 14    nodematch.Grade.10   3.335544  1  1.826347
## 15    nodematch.Grade.11   5.797696  1  2.407840
## 16    nodematch.Grade.12   2.578454  1  1.605756
## 17   nodematch.Race.Hisp   6.151197  1  2.480161
## 18  nodematch.Race.NatAm   5.384731  1  2.320502
## 19  nodematch.Race.White   1.487202  1  1.219509
## 20         nodematch.Sex   2.864719  1  1.692548
## 21                 edges 303.297061  1 17.415426
## 
## $vif_term
##                                                term          vif df  vif_sqrt
## 1                               nodefactor("Grade") 2.465512e+05  5  3.460913
## 2                                nodefactor("Race") 6.849846e+02  4  2.261831
## 3                                 nodefactor("Sex") 2.390419e+00  1  1.546098
## 4                   nodematch("Grade", diff = TRUE) 4.345004e+05  6  2.950071
## 5 nodematch("Race", diff = TRUE, levels = -c(1, 4)) 3.817385e+01  3  1.834964
## 6                    nodematch("Sex", diff = FALSE) 2.864719e+00  1  1.692548
## 7                                             edges 3.032971e+02  1 17.415426
## 
## attr(,"class")
## [1] "ergm_vif"

Drop `edges` term

(probably what one would do for this model)

vif_ergm_new(fit, drop_terms = "edges")
## $vif_coef
##                     coef       vif df vif_sqrt
## 1     nodefactor.Grade.8 49.952154  1 7.067684
## 2     nodefactor.Grade.9 36.315347  1 6.026222
## 3    nodefactor.Grade.10 19.389272  1 4.403325
## 4    nodefactor.Grade.11 28.278523  1 5.317755
## 5    nodefactor.Grade.12 14.339479  1 3.786750
## 6   nodefactor.Race.Hisp  9.978772  1 3.158920
## 7  nodefactor.Race.NatAm  9.322576  1 3.053289
## 8  nodefactor.Race.Other  1.061432  1 1.030258
## 9  nodefactor.Race.White  3.351882  1 1.830815
## 10      nodefactor.Sex.M  1.100157  1 1.048884
## 11     nodematch.Grade.7 61.414244  1 7.836724
## 12     nodematch.Grade.8 13.215291  1 3.635284
## 13     nodematch.Grade.9  5.959511  1 2.441211
## 14    nodematch.Grade.10  3.186973  1 1.785210
## 15    nodematch.Grade.11  5.332741  1 2.309273
## 16    nodematch.Grade.12  2.507388  1 1.583473
## 17   nodematch.Race.Hisp  4.518622  1 2.125705
## 18  nodematch.Race.NatAm  4.203501  1 2.050244
## 19  nodematch.Race.White  1.459541  1 1.208115
## 20         nodematch.Sex  1.021836  1 1.010859
## 
## $vif_term
##                                                term          vif df vif_sqrt
## 1                               nodefactor("Grade") 91630.055413  5 3.134756
## 2                                nodefactor("Race")    21.798404  4 1.469952
## 3                                 nodefactor("Sex")     1.100157  1 1.048884
## 4                   nodematch("Grade", diff = TRUE) 89456.359111  6 2.586034
## 5 nodematch("Race", diff = TRUE, levels = -c(1, 4))    18.958127  3 1.632924
## 6                    nodematch("Sex", diff = FALSE)     1.021836  1 1.010859
## 
## attr(,"class")
## [1] "ergm_vif"

Drop terms `edges` and `nodematch` for `Grade`

vif_ergm_new(fit, drop_terms = c("edges", 'nodematch("Grade", diff = TRUE)'))
## $vif_coef
##                     coef      vif df vif_sqrt
## 1     nodefactor.Grade.8 3.264765  1 1.806866
## 2     nodefactor.Grade.9 4.165662  1 2.040995
## 3    nodefactor.Grade.10 4.155404  1 2.038481
## 4    nodefactor.Grade.11 3.995176  1 1.998794
## 5    nodefactor.Grade.12 4.079305  1 2.019729
## 6   nodefactor.Race.Hisp 9.642242  1 3.105196
## 7  nodefactor.Race.NatAm 9.152473  1 3.025305
## 8  nodefactor.Race.Other 1.047986  1 1.023712
## 9  nodefactor.Race.White 3.173251  1 1.781362
## 10      nodefactor.Sex.M 1.040865  1 1.020228
## 11   nodematch.Race.Hisp 4.402948  1 2.098320
## 12  nodematch.Race.NatAm 4.149227  1 2.036965
## 13  nodematch.Race.White 1.425276  1 1.193849
## 14         nodematch.Sex 1.019944  1 1.009923
## 
## $vif_term
##                                                term       vif df vif_sqrt
## 1                               nodefactor("Grade")  1.033272  5 1.003278
## 2                                nodefactor("Race") 18.288827  4 1.438047
## 3                                 nodefactor("Sex")  1.040865  1 1.020228
## 4 nodematch("Race", diff = TRUE, levels = -c(1, 4)) 17.887558  3 1.617181
## 5                    nodematch("Sex", diff = FALSE)  1.019944  1 1.009923
## 
## attr(,"class")
## [1] "ergm_vif"

mbojan commented 3 years ago

We can autodetect the intercept (or the set of statistics that are collectively the intercept) on the MPLE predictor matrix, by finding those columns for which a linear combination exists that produces a vector of 1s. (I am surprised that car::vif.default() doesn't do that.) I am not sure how we would do that with the MCMC sample, without attaching an edge count/sum statistic to the returned change statistics (which may not be that bad an idea).

I like that a lot! It will be easy to integrate as I'd only need the vector of coef/term names of these columns.

krivit commented 3 years ago

Technically, we can Enhances car---though the user would need to install it to get the VIF capability, for no clear reason.

No real comments from me. If people think it's useful, and it doesn't take up too much additional space or complicate the code excessively, it may as well go in. We really should do the whole modularisation (#186) sooner than later, though, and then we'll have a relatively lightweight package into which we can incorporate all the diagnostics utilities we can imagine.

mbojan commented 3 years ago

Technically, we can Enhances car---though the user would need to install it to get the VIF capability, for no clear reason.

Yeah... so not really a solution. Not to mention inflated R CMD check time because of all additional deps...

No real comments from me. If people think it's useful, and it doesn't take up too much additional space or complicate the code excessively, it may as well go in. We really should do the whole modularisation (#186) sooner than later, though, and then we'll have a relatively lightweight package into which we can incorporate all the diagnostics utilities we can imagine.

It is a rather isolated piece with no extra dependencies so I guess there will be no problem to move it around when the time comes.

An alternative to drop_terms expecting a character vector is a formula, just like gof(GOF=) argument.

statnet / ergm