Open bdpedigo opened 4 years ago
@bdpedigo can I assist you with this?
not at the moment @rajpratyush, but thanks. i think i mostly need to decide what I want for this page right now.
I think how about a detailed summary of the docs website of graspologic @bdpedigo like going through each page and making important notes from them and then compiling them together
A network, or graph, is a convenient mathematical way to represent relational data. Networks represent objects (termed nodes or vertices) and the relationships between them (called edges or sometimes links). graspologic
contains algorithms for understanding and drawing inferences from network data.
Given the power of modern machine learning methods for operating on vector data, we often wish to leverage these tools by converting some aspects of our networks to vectors. The embed
module contains a variety of tools for estimating vector/matrix
representations of networks:
AdjacencySpectralEmbed
, LaplacianSpectralEmbed
, and node2vec
all represent each node in a single network as a vector (or, for the spectral methods, as a pair of vectors for a directed network).OmnibusEmbed
takes in a collection of node-matched networks, and finds a vector for each node in each graph. Node-matched means that the first node in network 1 represents the same object as the first node in network 2, and so on.MultipleASE
takes in a collection of node-matched networks, and learns a joint vector for each node across all of the networks, as well as a unique representation for each input network.mug2vec
(short for multigraph-to-vector) takes in a collection of node-matched networks, and finds a vector to represent each network.Note that many of these algorithms also can be thought of as inferring the parameters of statistical network models, which can be useful for statistical inference (described below).
The single-graph methods in the embed
module are useful for learning a representation of a single network if one is only interested in doing analyses on that network alone. However, if we just embed two networks separately, the vector representations do not live in the same space - the vectors have meaning with regard to other vectors for the same network, but have no meaning with regard to those from another. In order to meaningfully compare the representations from two single-graph embeddings, we need to align the vectors that represent each node. The align
module contains several methods for doing this:
SignFlips
is a very simple heuristic that may be useful if speed is important, or one wishes to find an initialization for another algorithm.OrthogonalProcrustes
is a well-known method for finding an alignment of a collection of vectors if we know a correspondence between the two collections for each vector. For networks, this will be the case if we have node-matched networks as described in embed
.SeedlessProcrustes
is a method for aligning when the correspondence between the collections of vectors is unknown, or even when the number of vectors is not the same.Clustering is a fundamental unsupervised data analysis problem, in which we seek to find groups of objects which are similar according to some definition. For networks, we often which to cluster the vector representations learned by the embed
module. The cluster
module contains several ways of clustering without having to specify the number of clusters or other parameters in advance.
A classic network analysis technique is to uncover communities or modules from a network. An assortative community is one in which nodes within a community are more likely to have edges to other members of that community than they are to other communities. The partition
module contains algorithms for partitioning a network into these assortative communities and evaluating their modularity, a metric of how assortative a network and its partition are.
Simulating networks (that is, sampling a new network from some distribution) has a variety of applications: it can be used to test algorithms, to compare observed network properties to some null distribution, or to study the properties of these network distributions, to name a few. The simulations
module provides functions for sampling new networks from a variety of distributions from the statistics literature. These functions all require the user to know and provide the parameters of the distribution they wish to sample from.
Given an observed network, we may wish to estimate the parameters of one of the network distributions described in simulations
. The tools in the models
module allow the user to pass in a network, and get back estimated parameters of these distributions. It also contains tools for assessing the fit of these models to the observed data.
If we observe two networks, a natural question is often to ask whether they are "the same" in some sense. However, in real data situations, it is unlikely that we will observe two networks which are exactly the same. To make an analogy to classical statistics, we could flip a fair coin 10 times and count the number of heads (experiment 1), then do the same with another coin (experiment 2). We wish to infer whether the two coins have the same probability of coming up heads - but due to the randomness in this experiment, we wouldn't just conclude that the two coins have different probabilities just because the counts are different. This problem is known as the two-sample testing problem. The inference
module contains algorithms for performing principled two-sample tests on pairs of networks:
For two networks, we sometimes don't know the correspondence or matching between a node in graph 1 and a node in graph 2. Often, knowing this correspondence is of interest in practical applications, or simply is useful for some downstream inference task. The match
module allows the user to input two graphs, and get an estimate of a matching between the nodes of two networks. These tools also are more general than to just networks and can be used in general to find alignments of two matrices.
Plotting a network in a sensible way is notoriously difficult as the number of nodes and edges grows. The layouts
module contains methods for finding a reasonable 2D plot representation of the nodes and edges of a network, with tools for automatically coloring the nodes of a network by predicted community and other automated visual tweaks.
@dwaynepryce inspired by our conversation today I wrote down a rough draft of a high level overview of the whole package that I've been meaning to do for a while. Thought about adding pretty pictures to it but we'll see if I have time, I think the text is probably most important. I envisioned this either being the landing page in the docs or at least being the "introduction" that comes right after the landing page. Perhaps this is too verbose but I also feel like if you are interested in a module, reading 5-6 sentences isn't that big of an ask?
Regardless, feedback welcome (general or specific). For specific comments, may be easier once I make an actual PR, tho.
Would like to have a page (in documentation and/or readme, but leaning towards just documentation) that basically explains each module at an extremely high level and why it might be useful. Could have very simple infographics too.
Would probably reorganize the below to match module structure better, make full sentences, etc.
Specialized graph algorithms
Useful but not graph specific
Thoughts welcome.