statnet / ergm

Fit, Simulate and Diagnose Exponential-Family Models for Networks
Other
98 stars 37 forks source link

as.network.numeric() should avoid allocating an n*n sociomatrix. #210

Closed krivit closed 1 year ago

krivit commented 3 years ago
library(ergm) # Any current version.
dummy <- network.initialize(1e6) # Works.
dummy <- as.network(1e6, density=0) # Memory error due to trying to allocate a sociomatrix.
#> Error: cannot allocate vector of size 7450.6 Gb

Created on 2020-12-24 by the reprex package (v0.3.0)

krivit commented 1 year ago

@AryaKarami, in terms of options, one was to "outsource" to sna::rgraph() if sna were installed (testing via requireNamespace()). The other was to roll our own, using some of these approaches:

  1. Generate a vector of "gaps" between successive edge indices using rgeom(), then use cumsum() to compute the dyad indices and translate them to tails and heads.
  2. Generate the total edges from binomial distribution (or use the number passed in numedges), then select dyad indices using sample.int() and translate them to tails and heads.

I didn't mention Approach 2 during our meeting, but it occurred to me after.

mbojan commented 1 year ago

Oh yes please. I hit that wall myself in the past.

Does it make more sense to have it in network rather than ergm?

krivit commented 1 year ago

@AryaKarami , I think Approach 2 will probably work better, because it reduces density case to the numedges case.

Also, since the largest integer that can be stored as double is apparently 2^53, the code needs to check that the number of dyads (potential edges) in the network does not exceed that number, producing an error otherwise.

krivit commented 1 year ago

Does it make more sense to have it in network rather than ergm?

Possibly. At the same time, I think to include it in network, we should have it work for all possible situations that network will support, including hypergraphs and multigraphs, whereas in ergm we are justified in only supporting the cases ergm() handles. @CarterButts?

mbojan commented 1 year ago

Does it make more sense to have it in network rather than ergm?

Possibly. At the same time, I think to include it in network, we should have it work for all possible situations that network will support, including hypergraphs and multigraphs, whereas in ergm we are justified in only supporting the cases ergm() handles. @CarterButts?

Well, in the long term, maybe.

CarterButts commented 1 year ago

We already have efficient and flexible Bernoulli graph production in sna (rgraph), which makes use of the usual fast Bernoulli graph tricks (and uses edgelists). Duplicating that in network or other statnet packages doesn't make a lot of sense. (To be honest, network really shouldn't even have that as.network.numeric() method, in my opinion. Random graph generation is a modeling feature, and should not be in the base data type package.) @krivit is right that we should ideally have support for non-dyadic other special cases for network functions, though as noted this is functionality that was sort of stuck in there in the first place. I am loathe to remove it because it's been there forever, and some folks presumably find it handy for testing or demo purposes. On the other hand, I'm also not so thrilled at the prospect of making more investment in it - someone who wants to generate random graphs in any serious capacity ought (IMHO) to be using sna or ergm, and not relying on the as.network.numeric() hack. Given that statnet already has tools for the purpose, adding better random graph generation to network looks like mission creep. Is there an affirmative reason to do it? (I.e., a real use case where someone can't use the other statnet tools, and really needs to produce huge Bernoulli graphs in network per se?)

krivit commented 1 year ago

@CarterButts, as.network.numeric() is a part of ergm. The reasons we aren't using sna outright are that 1) it would create a hard dependency, and 2) it doesn't handle bipartite networks, as far as I can tell.

CarterButts commented 1 year ago

@krivit you are right - my brain is apparently broken, and I had confabulated that it had been placed in network at some point. Setting aside implications for my mental state, I obviously don't think that moving this to network is a good idea. I can see the concern about multiplying dependencies. One idea is to port over the backend from rgraph, which is pretty simple; however, it does make use of some memory structure utilities that are in the sna backend and may not fit the ergm backend idiom. Would probably be easy enough to adapt.

AryaKarami commented 1 year ago

Generating Directed Bipartite network?

Thank you all for the ideas, especially for the idea in approach 2 (Binomial Distribution), it was interesting to implement. Now, another question arises; is it necessary to generate directed Bipartite Graphs in "ergm"? In fact, in as. network () when the user specifies the graph as a bipartite, the function considers it as an undirected bipartite and sets directed <- FALSE by default.

For example, what we can do is first generate the undirected one, and then randomly set the direction of edges between partitions (This is the first approach that came to my mind and I am not sure whether it is efficient).