implement spectral vertex nomination

jovo commented 5 years ago

https://github.com/neurodata/primitives-interfaces/tree/master/jhu_primitives/sgvn

bdpedigo commented 5 years ago

@hhelm10 can you add a link to the associated paper for this?

hhelm10 commented 5 years ago

https://arxiv.org/pdf/1905.01776.pdf

dfrancisco1998 commented 4 years ago

Id like to work on this!

spencer-loggia commented 4 years ago

Also interested!

dtborders commented 4 years ago

Also interested!

bdpedigo commented 4 years ago

@hhelm10 think you could give us a quick rundown of spectral (single graph) VN? My original thought was a simple interface that eats a graph or embedding and does nearest neighbor queries... but I see now that there are more complicated schemes even before the ILP stuff.

hhelm10 commented 4 years ago

sure --

will give a short and sweet description of a set of methods..

i) start with an n by n graph G and a vertex of interest v;. ii) pick your favorite embedding method to get n points in R^{d} (each d-dimensional vector is to be interpreted as a representation of a vertex in G); iii) then, depending on the embedding method you use, pick a distance / dissimilarity on R^{d} (for instance, if i suspect my graph is generated from an SBM and i embed the graph using ASE / LSE i may choose either Euclidean distance for simplicity or Mahalanobis distance to exploit the assumed structure of the point in R^{d}); iv) calculate the distances between the vector corresponding to v and vectors corresponding to the other n - 1 vertices; v) return the closest k vertices to v* says your selected distance;

i can see an interface where you initialize the object with an embedding method and a distance, the "fit" method taking G, and the "predict" method taking v* and (optionally) k.

does that make sense?

spencer-loggia commented 4 years ago

Thanks @hhelm10 , that's very helpful. I've been reading some papers on this, and have been a bit confused by the wide variety of similar, but not identical, methodologies.

hhelm10 commented 4 years ago

yea -- a lot of the methodologies that people come up w are pretty application / domain specific and so there are small tweaks here and there that try to optimize performance for the particular class of graph that they have access to (i.e. use LSE if graph is particularly sparse, somehow incorporate node / edge attributes, etc.). but for graphs larger than 50 nodes the typical procedure is to embed and apply / learn a distance to then rank the rest of the nodes.

a small aside: some of our recent work tries to avoid having to select an embedding method / distance by combining a bunch of them using supervisory information.

hhelm10 commented 4 years ago

Here would be my first draft of a VertexRanker class. Note that embedding_method is an instantiated embedding method and that pairwise_distance is not defined:

class VertexRanker:
    def __init__(embedding_method=ASE,
                      distance='Euclidean',
                      distance_kwargs = None):
        self.embedding_method = embedding_method
        self.distance = distance
        self.distance_kwargs = distance_kwargs

    def fit(G):
        X_hat = self.embedding_method.fit_transform(G)
        self.dist_matrix = pairwise_distance(X_hat, distance=self.distance, distance_kwargs=self.distance_kwargs)
        self.fitted = True

    def predict(index, k=1):
        return np.argsort(self.dist_matrix[index])[1:k+1]

hhelm10 commented 4 years ago

@bdpedigo then to include something like the ILP you could let embedding_method, distance, distance_kwargs be a dictionaries and just make the fit function a little more complicated

spencer-loggia commented 4 years ago

DoD for this issue:

Minimum Features

A new vertex nomination class, with a spectral VN subclass that can take an adj. matrix and a list of seed vectors and associated labels, and can return Nomination Lists of Vertices for each class of interest ordered by probability.
Should have options for different embedding methods, and use existing implementations.
should have options for different clustering methods, particularly kmeans and gmm.
should run very quickly, even for graphs with millions of vertices.
All new classes and methods will be properly documented.
A short tutorial notebook will be made to demonstrate functionality.

Checks

Should preform very well on synthetic SBM data, ~100% accurate (for the first K nominees, where k is # vertices in block) even without many seed vertices.
Should be able to run on the enron dataset (http://research.cs.queensu.ca/home/skill/siamworkshop.html) and be able to approximately reproduce the performance result shown in Fishkind, et. al., Vertex Nomination for Membership Prediction, figure 3.
Should be able to successfully nominate vertices with matching cell type labels from the drosophila dataset included with graspologic.

Ideal Features Planning to do pretty much all of these, but goes beyond the basic functionality

should be extended into the multigraph case, e.g. should be able to give seed vertices in graph A, and nominate vertices from related graph B.
Should have the option to choose different distance metrics to be used in clustering
Should be able to use semi-supervised kmeans and gmm for clustering (in standard implementation, seed vertices are essentially wasted except for deciding cluster label)
Should have a tutorial notebook going through a more detailed, real world use case.
Should be able to optionally infer K (number of blocks)

bdpedigo commented 4 years ago

can take an adj. matrix

maybe also an embedding

for each class of interest ordered by probability.

You may have already read up on VN more than me, which is great - I get the "class nomination" idea but I believe there should also be a completely unsupervised notion (see @hhelm10's example implementation, which has no notion of class)

should run very quickly, even for graphs with millions of vertices.

Would be awesome, but first goal is to just get something that works. also "very quickly" is vague

Should have the option to choose different distance metrics to be used in clustering

For something like GMM and maybe even k-means just note that this is a non-trivial problem.

Should be able to optionally infer K (number of blocks)

also non-trivial problem

other notes:

would like the interface to match that of other nomination classes/functions as much as makes sense.
i think a class interface will make the most sense

jovo commented 4 years ago

I think this is the oldest/simplest paper/alg: https://arxiv.org/pdf/1201.4118.pdf

graspologic-org / graspologic

implement spectral vertex nomination #160