graspologic-org / graspologic

Python package for graph statistics
https://graspologic-org.github.io/graspologic/
MIT License
825 stars 144 forks source link

Create "graspologic at a glance" page #565

Open bdpedigo opened 4 years ago

bdpedigo commented 4 years ago

Would like to have a page (in documentation and/or readme, but leaning towards just documentation) that basically explains each module at an extremely high level and why it might be useful. Could have very simple infographics too.

Would probably reorganize the below to match module structure better, make full sentences, etc.

Specialized graph algorithms

Thoughts welcome.

rajpratyush commented 3 years ago

@bdpedigo can I assist you with this?

bdpedigo commented 3 years ago

not at the moment @rajpratyush, but thanks. i think i mostly need to decide what I want for this page right now.

rajpratyush commented 3 years ago

I think how about a detailed summary of the docs website of graspologic @bdpedigo like going through each page and making important notes from them and then compiling them together

bdpedigo commented 3 years ago

graspologic at a glance [WIP]

A network, or graph, is a convenient mathematical way to represent relational data. Networks represent objects (termed nodes or vertices) and the relationships between them (called edges or sometimes links). graspologic contains algorithms for understanding and drawing inferences from network data.

Embed

Given the power of modern machine learning methods for operating on vector data, we often wish to leverage these tools by converting some aspects of our networks to vectors. The embed module contains a variety of tools for estimating vector/matrix representations of networks:

Note that many of these algorithms also can be thought of as inferring the parameters of statistical network models, which can be useful for statistical inference (described below).

Align

The single-graph methods in the embed module are useful for learning a representation of a single network if one is only interested in doing analyses on that network alone. However, if we just embed two networks separately, the vector representations do not live in the same space - the vectors have meaning with regard to other vectors for the same network, but have no meaning with regard to those from another. In order to meaningfully compare the representations from two single-graph embeddings, we need to align the vectors that represent each node. The align module contains several methods for doing this:

Cluster

Clustering is a fundamental unsupervised data analysis problem, in which we seek to find groups of objects which are similar according to some definition. For networks, we often which to cluster the vector representations learned by the embed module. The cluster module contains several ways of clustering without having to specify the number of clusters or other parameters in advance.

Partition

A classic network analysis technique is to uncover communities or modules from a network. An assortative community is one in which nodes within a community are more likely to have edges to other members of that community than they are to other communities. The partition module contains algorithms for partitioning a network into these assortative communities and evaluating their modularity, a metric of how assortative a network and its partition are.

Simulations

Simulating networks (that is, sampling a new network from some distribution) has a variety of applications: it can be used to test algorithms, to compare observed network properties to some null distribution, or to study the properties of these network distributions, to name a few. The simulations module provides functions for sampling new networks from a variety of distributions from the statistics literature. These functions all require the user to know and provide the parameters of the distribution they wish to sample from.

Models

Given an observed network, we may wish to estimate the parameters of one of the network distributions described in simulations. The tools in the models module allow the user to pass in a network, and get back estimated parameters of these distributions. It also contains tools for assessing the fit of these models to the observed data.

Inference

If we observe two networks, a natural question is often to ask whether they are "the same" in some sense. However, in real data situations, it is unlikely that we will observe two networks which are exactly the same. To make an analogy to classical statistics, we could flip a fair coin 10 times and count the number of heads (experiment 1), then do the same with another coin (experiment 2). We wish to infer whether the two coins have the same probability of coming up heads - but due to the randomness in this experiment, we wouldn't just conclude that the two coins have different probabilities just because the counts are different. This problem is known as the two-sample testing problem. The inference module contains algorithms for performing principled two-sample tests on pairs of networks:

Match

For two networks, we sometimes don't know the correspondence or matching between a node in graph 1 and a node in graph 2. Often, knowing this correspondence is of interest in practical applications, or simply is useful for some downstream inference task. The match module allows the user to input two graphs, and get an estimate of a matching between the nodes of two networks. These tools also are more general than to just networks and can be used in general to find alignments of two matrices.

Layouts

Plotting a network in a sensible way is notoriously difficult as the number of nodes and edges grows. The layouts module contains methods for finding a reasonable 2D plot representation of the nodes and edges of a network, with tools for automatically coloring the nodes of a network by predicted community and other automated visual tweaks.

Plot

Nominate

Utils

bdpedigo commented 3 years ago

@dwaynepryce inspired by our conversation today I wrote down a rough draft of a high level overview of the whole package that I've been meaning to do for a while. Thought about adding pretty pictures to it but we'll see if I have time, I think the text is probably most important. I envisioned this either being the landing page in the docs or at least being the "introduction" that comes right after the landing page. Perhaps this is too verbose but I also feel like if you are interested in a module, reading 5-6 sentences isn't that big of an ask?

Regardless, feedback welcome (general or specific). For specific comments, may be easier once I make an actual PR, tho.