statnet / lolog

Latent Order Logistic (LOLOG) Graph Models
Other
5 stars 1 forks source link

time-consuming for the large network #3

Closed shangyuan232 closed 6 years ago

shangyuan232 commented 6 years ago

My data contains 455 edges, when I run nega.03<lolog(nega~edges+mutual()+triangles(),verbose=TRUE), it has already cost more than one hour and hasn't finished. So I'm wondering if LOLOG's time-consuming, although it avoids some of the degeneracy problems present in ERGM. It has no indication about how long it will cost.

`Initializing Using Variational Fit Initial Theta: -8.244272 6.009072 2.414755 Iteration 1 Drawing 1000 Monte Carlo Samples: |=================================================================================================================================| 100% Objective: 1330.48 Hotelling's T2 p-value: < 1e-05

... Iteration 45 Drawing 1000 Monte Carlo Samples: |=================================================================================================================================| 100% Objective: 3.633848 Half Step Back Iteration 46 Drawing 1000 Monte Carlo Samples: |==============================================================| 71%

ifellows commented 6 years ago

Thanks for your feedback. The package should fit a model like this in under a few minutes for a network of ~1,000 nodes. We are in early days yet, with the package and paper just released. Bear with us while we tackle the computational and modeling challenges of diverse datasets.

From your objective function, it looks like the algorithm has very nearly converged, but may be having difficulty with the last inch.

Can you please run the following so that I can better diagnose the problem

print(nega)
calculateStatistics(nega ~ edges() + mutual() + triangles() + star(2:4,"in") + star(2:4,"out") )
as.BinaryNet(nega)$size()
shangyuan232 commented 6 years ago

Hi, please see below.

print(nega) Network attributes: vertices = 759 directed = TRUE hyper = FALSE loops = TRUE multiple = TRUE bipartite = FALSE total edges= 455 missing edges= 0 non-missing edges= 455

Vertex attribute names: betweencentra clustercoef followers jtd pagerank tweets vertex.names

No edge attributes

calculateStatistics(nega ~ edges() + mutual() + triangles() + star(2:4,"in") + star(2:4,"out") ) edges mutual triangles in-star.2 in-star.3 in-star.4 out-star.2 out-star.3 out-star.4 183 20 76 302 1062 3558 171 181 157 as.BinaryNet(nega)$size() [1] 759

shangyuan232 commented 6 years ago

And it has finished now with error out. Iteration 80 Drawing 1000 Monte Carlo Samples: |=================================================================================================================================| 100%

Objective: 3.328825 Error in solve.default(var(transformedDiffs)/nrow(transformedDiffs)) : system is computationally singular: reciprocal condition number = 1.47691e-16

ifellows commented 6 years ago

Hum, something is very wrong there. lolog thinks the network only has 183 edges and they are almost as many triangles as are possible given the number of edges.

ah, your network has loops and multiple edges between the same nodes. These are not allowed in lolog modeling (or ergm for that matter). We should be checking for these and throwing an error.

Does your network actually have these features, or are these data processing artifacts?

shangyuan232 commented 6 years ago

Yes, my data is tweet conversational network, which actually has these features. And now I have removed the loops. And it still very slow, throwing an error as follows after half an hour.

Iteration 17 Drawing 1000 Monte Carlo Samples: |=================================================================================================================================| 100% Objective: 2.366014 Error in solve.default(var(transformedDiffs)/nrow(transformedDiffs)) : Lapack routine dgesv: system is exactly singular: U[3,3] = 0

What does this mean?

The detailed information you asked before are as follows now.

print(nega) Network attributes: vertices = 759 directed = TRUE hyper = FALSE loops = FALSE multiple = FALSE bipartite = FALSE total edges= 183 missing edges= 0 non-missing edges= 183

Vertex attribute names: betweencentra clustercoef followers jtd pagerank tweets vertex.names

No edge attributes

calculateStatistics(nega ~ edges() + mutual() + triangles() + star(2:4,"in") + star(2:4,"out") ) edges mutual triangles in-star.2 in-star.3 in-star.4 out-star.2 out-star.3 out-star.4 183 20 76 302 1062 3558 171 181 157 as.BinaryNet(nega)$size() [1] 759

ifellows commented 6 years ago

Thank you for sharing. This network has some very strong features. The mean degree is very small (0.25), so most of the nodes are isolates. Despite this, the two-star terms are very large which might indicate that the network is dominated by a few high degree nodes and/or is scale free.

What I'd like to recommend for this dataset is to use preferentialAttachment() and sharedNbrs() terms as in the last example of the paper. The purpose of these terms is to better model heavy tailed degree and esp distributions. However, these are only available for undirected networks as of now.

I was able to obtain a model fit by including a in-2-star term:

# make an empty placeholder network
m <- matrix(,ncol=2,nrow=0)
net <- new(DirectedNet,m, 759)

# fit the model using the observed mesa statistics
cl <- parallel::makeCluster(5)
fit <- lolog(net ~ edges+mutual() + star(2,"in"), theta = c(-8.28905038440779, 7.06554724165195, 0.579869722773685), targetStats = c(183, 20,302), verbose=3, nsamp=4000, cluster = cl)

Here is the summary table and goodness of fit on triangles

> summary(fit)
          observed_statistics      theta        se pvalue
edges                       0 -8.2890504 0.1454981 0.0000
mutual                      0  7.0655472 0.3036939 0.0000
in-star.2                   0  0.5798697 0.2566009 0.0238
> gofit(fit, net ~  triangles(), nsim = 1000)
          obs min mean max    pvalue
triangles   0   0 0.21  40 0.9234049

However, the it looks like the in-star term is highly skewed, so I'd recommend against it's use here.

pairs(fit$stats,pch=".")

image

I'd appreciate it if you could share this network with me to help me ensure that lolog can be applied to basically any network (ian@fellstat.com).

ifellows commented 6 years ago

Check for multi-edges and loops have been added (#16). Preferential attachment has been extended to directed networks (#16), leading to a good fit for this network.