krivit commented 2 years ago

Term description

This stems from this question on Stack Overflow: generalising it, suppose that each node i has some set A[i] of properties (I am avoiding "attributes", since we use that term elsewhere.). We wish to specify a dyadic predictor that, in pseudocode, can be represented x[i,j] = length(intersect(A[i], A[j])) (the number of properties i and j have in common) or x[i,j] = length(intersect(A[i], A[j])) > 0 (whether i and j have any properties in common).

Some examples:

A[i] is the set of languages i speaks, and we wish to use an indicator of whether i and j speak at least one common language as a predictor of their interaction. (This is from Stack Overflow.)
A[i] is a list of i's hobbies, and we wish to use the number of hobbies i and j have in common to predict acquaintance.
A[i] is a list of places i visited over the course of a day (e.g., from a contact diary), and we wish to use the number of common areas visited by i and j to predict whether they had a contact.

This seems like something that can be useful in a variety of circumstances.

A further generalisation of this concept is to make A[i] a mapping that maps property k to some value (e.g., proficiency in a language) so that, e.g., x[i,j] = max[k](min(A[i][k], A[j][k])) (or some other "interacting" and "combining" functions in place of min() and max[k](), respectively). In the language example, this predictor represents the proficiency of the less-proficient actor in the two actors' best common language (where "best common language" is the language in which the less-proficient actor has the highest proficiency).

In all cases, this would be a dyad-independent term, so in principle representable with edgecov().

Questions

How broadly useful would this be? I suspect @CarterButts and @mbojan might have some applications I hadn't thought about.
Would the generalisation to a mapping be useful? What "interacting" and "combining" functions would be useful?
What would be an efficient way to implement these?
What kind of a user interface (required data format and syntax) would we want for this term?

mbojan commented 2 years ago

Nice idea. It sounds useful to me especially if there would be some freedom of choice wrt "similarity function" f() in

f( A[i], A[j] )

where f() could include, next to your examples:

Cosine similarity
Jaccard coefficient (esp useful if A[i] is a set of binary variables)
some of those described in Choi, S. S., Cha, S. H., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of systemics, cybernetics and informatics, 8(1), 43-48.
or any R function of two vectors returning a numeric scalar

Implementation-wise, does such term have to essentially construct a matrix for edgecov on the R level, or there are computational "shortcuts" to exploit on the lower level?

krivit commented 2 years ago

Implementation-wise, does such term have to essentially construct a matrix for edgecov on the R level, or there are computational "shortcuts" to exploit on the lower level?

That remains to be seen. If the operation is on a set rather than a mapping, there is a number of ways to represent it, with different advantages and disadvantages:

Each int can encode set membership for up to 32 properties. Then, unions and intersections can be calculated by bitwise &. If there are more than 32 properties, one can have multiple ints per node, though the storage and computational costs grow linearly in the number of properties.
If there are no more than 2^32 distinct properties, then each node can have a sorted array of its property IDs; then an algorithm can iterate through each node's properties, testing for common members a la the merge sort. This method is not sensitive to the total number of distinct properties but is sensitive to the average number of properties a node has.

There are probably others.

For a mapping, Method 2 can be used, with the array of property IDs serving as keys and a parallel array for values. (One can also just store the values in a vector with one element per property for each node analogously to Method 1, but then one loses the benefits of compactness of the one-bit-per-property representation and the speed of bitwise operations.)

statnet / ergm

Implement a generalisation for nodematch() that tests if the two nodes have any attributes or properties in common or counts them? #481

Term description

Questions