statnet / ergm

Fit, Simulate and Diagnose Exponential-Family Models for Networks
Other
96 stars 37 forks source link

Implement a generalisation for nodematch() that tests if the two nodes have any attributes or properties in common or counts them? #481

Open krivit opened 2 years ago

krivit commented 2 years ago

Term description

This stems from this question on Stack Overflow: generalising it, suppose that each node i has some set A[i] of properties (I am avoiding "attributes", since we use that term elsewhere.). We wish to specify a dyadic predictor that, in pseudocode, can be represented x[i,j] = length(intersect(A[i], A[j])) (the number of properties i and j have in common) or x[i,j] = length(intersect(A[i], A[j])) > 0 (whether i and j have any properties in common).

Some examples:

This seems like something that can be useful in a variety of circumstances.

A further generalisation of this concept is to make A[i] a mapping that maps property k to some value (e.g., proficiency in a language) so that, e.g., x[i,j] = max[k](min(A[i][k], A[j][k])) (or some other "interacting" and "combining" functions in place of min() and max[k](), respectively). In the language example, this predictor represents the proficiency of the less-proficient actor in the two actors' best common language (where "best common language" is the language in which the less-proficient actor has the highest proficiency).

In all cases, this would be a dyad-independent term, so in principle representable with edgecov().

Questions

  1. How broadly useful would this be? I suspect @CarterButts and @mbojan might have some applications I hadn't thought about.
  2. Would the generalisation to a mapping be useful? What "interacting" and "combining" functions would be useful?
  3. What would be an efficient way to implement these?
  4. What kind of a user interface (required data format and syntax) would we want for this term?
mbojan commented 2 years ago

Nice idea. It sounds useful to me especially if there would be some freedom of choice wrt "similarity function" f() in

f( A[i], A[j] )

where f() could include, next to your examples:

Implementation-wise, does such term have to essentially construct a matrix for edgecov on the R level, or there are computational "shortcuts" to exploit on the lower level?

krivit commented 2 years ago

Implementation-wise, does such term have to essentially construct a matrix for edgecov on the R level, or there are computational "shortcuts" to exploit on the lower level?

That remains to be seen. If the operation is on a set rather than a mapping, there is a number of ways to represent it, with different advantages and disadvantages:

  1. Each int can encode set membership for up to 32 properties. Then, unions and intersections can be calculated by bitwise &. If there are more than 32 properties, one can have multiple ints per node, though the storage and computational costs grow linearly in the number of properties.
  2. If there are no more than 2^32 distinct properties, then each node can have a sorted array of its property IDs; then an algorithm can iterate through each node's properties, testing for common members a la the merge sort. This method is not sensitive to the total number of distinct properties but is sensitive to the average number of properties a node has.

There are probably others.

For a mapping, Method 2 can be used, with the array of property IDs serving as keys and a parallel array for values. (One can also just store the values in a vector with one element per property for each node analogously to Method 1, but then one loses the benefits of compactness of the one-bit-per-property representation and the speed of bitwise operations.)