bdpedigo commented 4 years ago

Would like to have a page (in documentation and/or readme, but leaning towards just documentation) that basically explains each module at an extremely high level and why it might be useful. Could have very simple infographics too.

Would probably reorganize the below to match module structure better, make full sentences, etc.

Specialized graph algorithms

Single graph embedding
- Learn a low-dimensional representation of a network where each node is a point
Multiple graph embedding (time series or otherwise)
- Jointly learn low-dimensional representations of multiple networks
Two-sample graph testing
- Test for differences between two networks
Graph matching
- Learn a mapping between the nodes in one graph and those in another
Statistical network model fitting and sampling
- Fit statistical models given an observed network
- Sample networks from a statistical model and its parameters
Signal subgraph identification
- Learn which part of a graph is most predictive of a graph-wise signal
(soon) Fast community detection
- Find modular components of a network
(soon) Time series of graph anomaly detection
- Find anomalous vertices or graphs from a time series
(soon) Vertex nomination
- Predicting similar nodes within the same graph or in another similar graph
  Useful but not graph specific
Automatic clustering
Plotting
Alignment

Thoughts welcome.

rajpratyush commented 3 years ago

@bdpedigo can I assist you with this?

bdpedigo commented 3 years ago

not at the moment @rajpratyush, but thanks. i think i mostly need to decide what I want for this page right now.

rajpratyush commented 3 years ago

I think how about a detailed summary of the docs website of graspologic @bdpedigo like going through each page and making important notes from them and then compiling them together

bdpedigo commented 3 years ago

graspologic at a glance [WIP]

A network, or graph, is a convenient mathematical way to represent relational data. Networks represent objects (termed nodes or vertices) and the relationships between them (called edges or sometimes links). graspologic contains algorithms for understanding and drawing inferences from network data.

Embed

Given the power of modern machine learning methods for operating on vector data, we often wish to leverage these tools by converting some aspects of our networks to vectors. The embed module contains a variety of tools for estimating vector/matrix representations of networks:

AdjacencySpectralEmbed, LaplacianSpectralEmbed, and node2vec all represent each node in a single network as a vector (or, for the spectral methods, as a pair of vectors for a directed network).
OmnibusEmbed takes in a collection of node-matched networks, and finds a vector for each node in each graph. Node-matched means that the first node in network 1 represents the same object as the first node in network 2, and so on.
MultipleASE takes in a collection of node-matched networks, and learns a joint vector for each node across all of the networks, as well as a unique representation for each input network.
mug2vec (short for multigraph-to-vector) takes in a collection of node-matched networks, and finds a vector to represent each network.

Note that many of these algorithms also can be thought of as inferring the parameters of statistical network models, which can be useful for statistical inference (described below).

Align

The single-graph methods in the embed module are useful for learning a representation of a single network if one is only interested in doing analyses on that network alone. However, if we just embed two networks separately, the vector representations do not live in the same space - the vectors have meaning with regard to other vectors for the same network, but have no meaning with regard to those from another. In order to meaningfully compare the representations from two single-graph embeddings, we need to align the vectors that represent each node. The align module contains several methods for doing this:

SignFlips is a very simple heuristic that may be useful if speed is important, or one wishes to find an initialization for another algorithm.
OrthogonalProcrustes is a well-known method for finding an alignment of a collection of vectors if we know a correspondence between the two collections for each vector. For networks, this will be the case if we have node-matched networks as described in embed.
SeedlessProcrustes is a method for aligning when the correspondence between the collections of vectors is unknown, or even when the number of vectors is not the same.

Cluster

Clustering is a fundamental unsupervised data analysis problem, in which we seek to find groups of objects which are similar according to some definition. For networks, we often which to cluster the vector representations learned by the embed module. The cluster module contains several ways of clustering without having to specify the number of clusters or other parameters in advance.

TODO: description of specific algos

Partition

A classic network analysis technique is to uncover communities or modules from a network. An assortative community is one in which nodes within a community are more likely to have edges to other members of that community than they are to other communities. The partition module contains algorithms for partitioning a network into these assortative communities and evaluating their modularity, a metric of how assortative a network and its partition are.

TODO: explain the specific tools

Simulations

Simulating networks (that is, sampling a new network from some distribution) has a variety of applications: it can be used to test algorithms, to compare observed network properties to some null distribution, or to study the properties of these network distributions, to name a few. The simulations module provides functions for sampling new networks from a variety of distributions from the statistics literature. These functions all require the user to know and provide the parameters of the distribution they wish to sample from.

TODO: explain the simulations we have

Models

Given an observed network, we may wish to estimate the parameters of one of the network distributions described in simulations. The tools in the models module allow the user to pass in a network, and get back estimated parameters of these distributions. It also contains tools for assessing the fit of these models to the observed data.

TODO explain the models we have

Inference

If we observe two networks, a natural question is often to ask whether they are "the same" in some sense. However, in real data situations, it is unlikely that we will observe two networks which are exactly the same. To make an analogy to classical statistics, we could flip a fair coin 10 times and count the number of heads (experiment 1), then do the same with another coin (experiment 2). We wish to infer whether the two coins have the same probability of coming up heads - but due to the randomness in this experiment, we wouldn't just conclude that the two coins have different probabilities just because the counts are different. This problem is known as the two-sample testing problem. The inference module contains algorithms for performing principled two-sample tests on pairs of networks:

TODO explain tests we have

Match

For two networks, we sometimes don't know the correspondence or matching between a node in graph 1 and a node in graph 2. Often, knowing this correspondence is of interest in practical applications, or simply is useful for some downstream inference task. The match module allows the user to input two graphs, and get an estimate of a matching between the nodes of two networks. These tools also are more general than to just networks and can be used in general to find alignments of two matrices.

TODO explain the one tool in this module

Layouts

Plotting a network in a sensible way is notoriously difficult as the number of nodes and edges grows. The layouts module contains methods for finding a reasonable 2D plot representation of the nodes and edges of a network, with tools for automatically coloring the nodes of a network by predicted community and other automated visual tweaks.

TODO explain what we have

Plot

TODO do

Nominate

TODO do

Utils

TODO do

bdpedigo commented 3 years ago

@dwaynepryce inspired by our conversation today I wrote down a rough draft of a high level overview of the whole package that I've been meaning to do for a while. Thought about adding pretty pictures to it but we'll see if I have time, I think the text is probably most important. I envisioned this either being the landing page in the docs or at least being the "introduction" that comes right after the landing page. Perhaps this is too verbose but I also feel like if you are interested in a module, reading 5-6 sentences isn't that big of an ask?

Regardless, feedback welcome (general or specific). For specific comments, may be easier once I make an actual PR, tho.

graspologic-org / graspologic

Create "graspologic at a glance" page #565

Specialized graph algorithms

Useful but not graph specific

graspologic at a glance [WIP]

Embed

Align

Cluster

Partition

Simulations

Models

Inference

Match

Layouts

Plot

Nominate

Utils