graspologic-org / graspologic

Python package for graph statistics
https://graspologic-org.github.io/graspologic/
MIT License
804 stars 143 forks source link

implement spectral vertex nomination #160

Closed jovo closed 3 years ago

jovo commented 5 years ago

https://github.com/neurodata/primitives-interfaces/tree/master/jhu_primitives/sgvn

bdpedigo commented 5 years ago

@hhelm10 can you add a link to the associated paper for this?

hhelm10 commented 5 years ago

https://arxiv.org/pdf/1905.01776.pdf

dfrancisco1998 commented 4 years ago

Id like to work on this!

spencer-loggia commented 4 years ago

Also interested!

dtborders commented 4 years ago

Also interested!

bdpedigo commented 4 years ago

@hhelm10 think you could give us a quick rundown of spectral (single graph) VN? My original thought was a simple interface that eats a graph or embedding and does nearest neighbor queries... but I see now that there are more complicated schemes even before the ILP stuff.

hhelm10 commented 4 years ago

sure --

will give a short and sweet description of a set of methods..

i) start with an n by n graph G and a vertex of interest v;. ii) pick your favorite embedding method to get n points in R^{d} (each d-dimensional vector is to be interpreted as a representation of a vertex in G); iii) then, depending on the embedding method you use, pick a distance / dissimilarity on R^{d} (for instance, if i suspect my graph is generated from an SBM and i embed the graph using ASE / LSE i may choose either Euclidean distance for simplicity or Mahalanobis distance to exploit the assumed structure of the point in R^{d}); iv) calculate the distances between the vector corresponding to v and vectors corresponding to the other n - 1 vertices; v) return the closest k vertices to v* says your selected distance;

i can see an interface where you initialize the object with an embedding method and a distance, the "fit" method taking G, and the "predict" method taking v* and (optionally) k.

does that make sense?

spencer-loggia commented 4 years ago

Thanks @hhelm10 , that's very helpful. I've been reading some papers on this, and have been a bit confused by the wide variety of similar, but not identical, methodologies.

hhelm10 commented 4 years ago

yea -- a lot of the methodologies that people come up w are pretty application / domain specific and so there are small tweaks here and there that try to optimize performance for the particular class of graph that they have access to (i.e. use LSE if graph is particularly sparse, somehow incorporate node / edge attributes, etc.). but for graphs larger than 50 nodes the typical procedure is to embed and apply / learn a distance to then rank the rest of the nodes.

a small aside: some of our recent work tries to avoid having to select an embedding method / distance by combining a bunch of them using supervisory information.

hhelm10 commented 4 years ago

Here would be my first draft of a VertexRanker class. Note that embedding_method is an instantiated embedding method and that pairwise_distance is not defined:

class VertexRanker:
    def __init__(embedding_method=ASE,
                      distance='Euclidean',
                      distance_kwargs = None):
        self.embedding_method = embedding_method
        self.distance = distance
        self.distance_kwargs = distance_kwargs

    def fit(G):
        X_hat = self.embedding_method.fit_transform(G)
        self.dist_matrix = pairwise_distance(X_hat, distance=self.distance, distance_kwargs=self.distance_kwargs)
        self.fitted = True

    def predict(index, k=1):
        return np.argsort(self.dist_matrix[index])[1:k+1]
hhelm10 commented 4 years ago

@bdpedigo then to include something like the ILP you could let embedding_method, distance, distance_kwargs be a dictionaries and just make the fit function a little more complicated

spencer-loggia commented 4 years ago

DoD for this issue:

Minimum Features

Checks

Ideal Features Planning to do pretty much all of these, but goes beyond the basic functionality

bdpedigo commented 4 years ago

can take an adj. matrix

maybe also an embedding

for each class of interest ordered by probability.

You may have already read up on VN more than me, which is great - I get the "class nomination" idea but I believe there should also be a completely unsupervised notion (see @hhelm10's example implementation, which has no notion of class)

should run very quickly, even for graphs with millions of vertices.

Would be awesome, but first goal is to just get something that works. also "very quickly" is vague

Should have the option to choose different distance metrics to be used in clustering

For something like GMM and maybe even k-means just note that this is a non-trivial problem.

Should be able to optionally infer K (number of blocks)

also non-trivial problem

other notes:

jovo commented 4 years ago

I think this is the oldest/simplest paper/alg: https://arxiv.org/pdf/1201.4118.pdf