scikit-tda / kepler-mapper

Kepler Mapper: A flexible Python implementation of the Mapper algorithm.

https://kepler-mapper.scikit-tda.org

MIT License

623 stars 180 forks source link

[Discussion] Future Research & Use Cases #35

Closed MLWave closed 4 years ago

MLWave commented 6 years ago

Research directions

[ ] Multiscale mapper.
- [ ] persistent homology of mapper - include visualization of barcode
- [ ] choose the most stable mapper based on barcode
[ ] Implement parameter tuning.
[ ] Simplicial Complex Feature extraction.
[x] Embeddings, Word2Vec or otherwise.
[x] Self-Guessing / Self-Dissimilarity.
[ ] Random Topological Forests.

Business Cases

[x] Transactional Fraud & AML. PII data.
[x] Acquisition fraud. PII data.
[ ] Access control. Business sensitive data.
[ ] Firewall monitoring. Business sensitive data.
[x] Credit risk & Reject Inference. PII data.
[ ] Customer management (churn, high spenders). PII data.

sauln commented 6 years ago

I suggest we have the top level comment on this issue be a list of action items or general directions with check boxes. We can then expound on each in comments below. We can rank these and continue to add to one list all of the ideas that have cropped up throughout all the issues.

sauln commented 6 years ago

I'm very interested in using embeddings as a lens to explore different data sets. I've been trying to explore word2vec embeddings using mapper, but haven't gotten any interesting results yet. I'm hoping to use mapper to represent/find bias in datasets along the lines of this recent paper: https://arxiv.org/abs/1607.06520

Do you see the embedding itself as the lens? I think we'd still need to reduce the lens down to ~3 dimensions before building the mapper.

sauln commented 6 years ago

Most obvious things to work on from my perspective are listed below. These are spurred by recent advances in theoretical work rather than applications. (I'm looking to gain more experience applying these methods to real data, rather than just the theoretical aspects.)

[ ] Multiscale mapper.
- Compute persistent homology and visualize barcodes.
- Compute 'the best' mapper given a lens. Essentially do hyperparameter optimization using persistence.
[ ] Implement parameter tuning.
- Incorporate ideas from Oudot's new parameter tuning paper: https://arxiv.org/abs/1706.00204
[ ] Feature extraction.
- I think this self-guessing concept would require the ability to automatically figure out how many legs (or flares) there are in the mapper. I have not seen any ways to automate this except maybe what my advisor is working on (http://jointmathematicsmeetings.org/amsmtgs/2197_abstracts/1135-55-1825.pdf).

MLWave commented 6 years ago

Business cases

At work (a fintech with credit card and bank account products), I use anomaly and credit score lenses to provide analysts with overviews of todays, last months, and all customers. Tooltips link to the back-end user profiles. I'm gathering feedback, to make TDA more accessible to people without knowledge of abstract algebra (but a lot of domain knowledge).

We are implementing dashboards for:

Transactional fraud & AML. Lens [Isolation Forest, L2Norm]. Color [prior label information]. Tooltips [Links to backend profiles]
Acquisition fraud (Identity fraud & Friends&Family fraud). Lens [Isolation Forest, L2Norm]. Color [prior label information]. Tooltips [Links to backend profiles]
Access control. Lens [Isolation Forest, GLOSH]. Color [GLOSH outlier score]. Tooltips [Names and squads/departments of employees]
Firewall monitoring. Lens [Isolation Forest, GLOSH]. Color [GLOSH outlier score]. Tooltips [IP address].
Credit risk / reject inference model doctoring. Lens [Credit Score Prediction/distance to rejected population], Color [prior default information], Tooltips [Links to backend profiles]
Customer management case studies. Lens and color [anything interesting: default, spend, churn ...], Tooltips [Cluster ID assignments].

The [Isolation Forest, L2Norm]-lens combo works very well on acquisition fraud. The flares showed interesting features for different types of fraudsters. We also found an interesting set of informative features, which, in hindsight, make a ton of sense.

MLWave commented 6 years ago

Embeddings

https://www.quora.com/Is-it-possible-to-apply-topological-data-analysis-to-text-mining-or-natural-language-processing-problems

http://mlwave.github.io/tda/quartettree-lotr.html (Hierarchical clustering of sentence co-occurence vectors)

Embeddings, like those from FaceNet, GLOVE, Doc2Vec, or Word2Vec (see FastText, GenSim and pre-trained models from companies like Google and Facebook), can be reduced down to 2 or 3 dimensions with PCA, LLE, or t-SNE.

Just works: Use the original (128 or 150 dimensional) embeddings or LSA (TFIDF2grams->SVD100d)as the inverse image to cluster on. Also works: Bag of words encode a document for the inverse image. Works better (but maybe cheating): tfidf scaling, 2-3 grams/chargrams, removing stopwords, stemming, etc. I mean... you have to represent the original data somehow right, can't feed ascii into a cluster algorithm? Or am I thinking too hard: Mapper accepts any distance matrix, as long as I properly state whether this distance satisfies triangle inequality.

This library by Facebook Research allows us to create embeddings of all the things (part of LeCun's object2vec vision): https://github.com/facebookresearch/StarSpace I think using embeddings of text documents or social graphs as lenses is an interesting venue for further research.

For Bias representation:

The word2vec distance between "man" and "professor" is smaller than the distance between "woman" and "professor", so:

Inverse X = 300D pre-trained Word2Vec vectors for set of words (IE "List of over 12,000 Careers"). Projected X = For each career vector: (cosine?) distance to vector for "male" and "female" (or other combinations, like "man", "woman") Clusterer = Kmeans 2 euclidean. Color function = Sentiment analysis score on career token. (or distance between career vector and vector for tokens like "good", "skillful", "empathy") Tooltip = Text representation.

You can view which neurons or embedding vectors are active for certain clusters. For instance, is Dimension-81 of GoogleNews-vectors-negative300.bin active for technicians and mechanics related words?

Do you see the embedding itself as the lens? I think we'd still need to reduce the lens down to ~3 dimensions before building the mapper.

Mapping on selected subsets of dimensions can answer my own question (and yours).

MLWave commented 6 years ago

Self-Guessing

http://jointmathematicsmeetings.org/amsmtgs/2197_abstracts/1135-55-678.pdf

Self-Guessing introduced in A Mathematical Theory of Generalization series, D. Wolpert (1990).

Intuitively, self-guessing is the requirement that, using the generalizer in question, the learning set must be self-consistent. If a subset of the learning set is fed to the generalizer, the generalizer must correctly guess the rest of the learning set. [...] Finally, statisticians have experimented with simplified versions of (weak) self-guessers under the name of "cross-validation".

Perform a cross-mapping, treating sets of lenses as generalizers in the Self-Guessing framework?

Self-Dissimilarity:

For many systems characterized as “complex” the patterns exhibited on different scales differ markedly from one another. [...] Accordingly, the degrees of self-dissimilarity between the patterns of a system at various scales constitute a complexity “signature” of that system. Whatever distance measure is chosen, our quantification of self-dissimilarity can be measured for many kinds of real-world data.

With Normalized Compression Distance you can use data compressors as lenses: http://mlwave.github.io/tda/mapper_visualization_ncd.html

Experiment using [BZ2 Compression ratio, t-SNE 1-D of Normalized Compression^Snappy Distance Matrix] on binary files, clustering on the projection.

Optimal High Level Descriptions of Dynamical Systems

Often these high-level descriptions are constructed without considering the ultimate reason for needing them in the first place. Here, we formalize and quantify one such purpose: the need to predict observables of interest concerning the high-dimensional system with as high accuracy as possible, while minimizing the computational cost of doing so. [...] We also discuss some other possible applications of SSC beyond the goal of accurate prediction. These include SSC as a measure of the complexity of a dynamical system, and as a way to quantify information flow between the scales of a system.

"""
   x
y [ 0 0 0 0 ]
  [ 0 0 0 0 ]
  [ 1 1 1 ? ]
  [ 0 0 0 ? ]

lenses set = ["distance_x_axis", "distance_y_axis"]
nr_cubes = 4
overlap_perc = 0.
"""

y_train = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0]
X_train = [[1,1], [1,2], [1,3], [1,4],
           [2,1], [2,2], [2,3], [2,4],
           [3,1], [3,2], [3,3], 
           [4,1], [4,2], [4,3]        ]

X_test = [[3,4], [4,4]]
y_test = [1, 0]

from sklearn import tree
model = tree.DecisionTreeClassifier(random_state=0, 
                                    max_depth=2, 
                                    criterion="entropy")
model.fit(X_train, y_train)
p = model.predict(X_test)
print p
print model

"""
   x
y [ 0 0 0 0 ]
  [ 0 0 0 0 ]
  [ 1 1 1 1 ]
  [ 0 0 0 0 ]

def tree(distance_x_axis, distance_y_axis):
  if distance_x_axis <= 2.5:
    return [[ 8.  0.]]
  else:  # if distance_x_axis > 2.5
    if distance_x_axis <= 3.5:
      return [[ 0.  3.]]
    else:  # if distance_x_axis > 3.5
      return [[ 3.  0.]]
"""

In foolish terms: use Projected_X as features to predict Inverse_X, Lens set selection is a function of accuracy (cross-mapping validation on learning set) and complexity (avoid needless computational cost/Occam's Razor).

Lenses that perform well on the learning set, but generalize poorly to held-out data describe the location of increased object-part complexity. The program length/complexity of the resulting trees models may say something about the object's complexity (the computation required to create the object). Can be extended to probability predictions (generate a point with random probability).

   x
y [ 0 0 0 0 ]
  [ 0 0 0 0 ]
  [ 1 1 1 0 ]
  [ 0 0 0 1 ]

Unsupervised Auto-Encoders can now be used on tabular structured data with a new technique of random-noise-swap as shown by Michael Jahrer latest Kaggle win. Before they were mainly used on images. Auto-encoders have some nice properties for creating efficient compressed representations and from these, accurately reconstructing the inverse data.

MLWave commented 6 years ago

Dionysus has Python bindings, but I haven't tried it out yet.

yuzuhikorunrun commented 6 years ago

I do wonder if you guys are ever planning to enable the feature so that we canselect and download groups of nodes. (for data validation purpose mostly).

Also, for the analysis part, we need to be able to find out the "driver"for each subgroups so that we know what drives separation (maybe some nonparametric tests with p values and scores)....Let me know if that make sense...

thanks.

MLWave commented 6 years ago

I do wonder if you guys are ever planning to enable the feature so that we canselect and download groups of nodes. (for data validation purpose mostly).

I guess this would require a local web server. A local web server would enable lots of interesting features and is the ideal interactivity we are working towards. But no ETA on this, just finishing up first update of the visualization. I also want to explore supporting a Jupyter notebook flow (maybe as a transitory way).

Also, for the analysis part, we need to be able to find out the "driver"for each subgroups so that we know what drives separation (maybe some nonparametric tests with p values and scores)....Let me know if that make sense...

There is 1-2-3 STD rule (feature std diff from inverse_X mean), there is also training whitebox models with the cluster samples labeled as 1, and a random sampling of all the samples labeled as 0.

HDBSCAN has interesting outlier and density statistics (we can use for either entire inverse_X or on the cluster data and present statistics and use for color functions).

If you want to contribute specifications or code: Consider you have an array of the inverse_X of each cluster, and the complete inverse_X. For every dimension you have a list of natural language description strings.

sauln commented 6 years ago

@yuzuhikorunrun, incorporating both of those ideas are definitely on the horizon. I think the Jupyter notebook as @MLWave mentioned would be a great first step. There is a function data_from_cluster_id that does want you want, just not in the loop.

If you'd like to work on this, pull requests would be very welcome! Let me know if you need any help.

MLWave commented 6 years ago

Working on right-click square selections, this would show the selected cluster ids in a list. Also, drawing graphs could be a function that can be called multiple times (painting multiple complexes on the screen).

http://bl.ocks.org/lgersman/5310854

So imagine some flow like this:

graph_all = map(inverse_X, cover=Cubical(5))

> mouse-select a subset from the graph as subset_ids

subset_X = data_from_subset(subset_ids)

graph_subset = map(subset_X, cover=Cubical(20))

visualize([graph_all, graph_subset])

Found out to do a lot with SVG that I used to do with CSS. But it is rather slow. So you can switch between different modes:

Fancy Christmas Theme (design over function).
Somewhat fast Display mode for exploration
Print mode for printing

by pressing keys or setting visualization parameters.

I'm also starting to get interested in visualizing the color-function on the graph with simple rules like:

if node color value closer to max value than min value, draw an arrow edge to the node with a value closer to the max value.
if node color value closer to min value than max value, draw an arrow edge to the node with a value closer to the min value.

Color function distribution histograms can look close to crude barcodes:

make_circles, lens = [0,1]

visualize(graph, 
          custom_tooltips=y, 
          color_function=mapper.fit_transform(
            inverse_X, 
            projection="dist_mean"))

MLWave commented 6 years ago

Funny aside: I wonder if with multi-scale mapper, if KeplerMapper is then Turing complete? You can already use KeplerMapper to calculate if a digit is even or uneven (by representing the digit 7 as [1,1,1,1,1,1,1] and setting n_cubes=2). Then the node sizes of the graph are a perfect model for a digit being even or uneven. With multi-scale coverings you could show if a digit is a prime number.

sauln commented 6 years ago

I wonder if with multi-scale mapper, if KeplerMapper is then Turing complete

That would be hilarious :rofl: Maybe someone will better theoretical CS chops than I try proving this.

I don't follow how km could show even or odd? Would 6 ([1,1,1,1,1,1]) and 7 ([1,1,1,1,1,1,1]) show up in different clusters?

MLWave commented 6 years ago

I don't follow how km could show even or odd? Would 6 ([1,1,1,1,1,1]) and 7 ([1,1,1,1,1,1,1]) show up in different clusters?

Every number is its own dataset and produces its own graph. 6 ([1,1,1,1,1,1]) with 2 cubes and 0. overlap would produce nodes: [1,1,1] [1,1,1] which would be the same size (and have the same color if the color function looks at cluster member size). 7 ([1,1,1,1,1,1,1]) with the same settings would produce [1,1,1,1] [1,1,1], which graph would look different from even numbers.

You could probably also map numbers within a single dataset. But then you'd need to zero-fill/NaN fill to make the rows of equal size: 6 [1,1,1,1,1,1,0,0,0,0] 7 [1,1,1,1,1,1,1,0,0,0]. Something similar could be done with binary representations.

All very nerdy though. But if Excel is Turing-complete why not KeplerMapper?

MLWave commented 6 years ago

The latest and greatest from Deep Learning is Deep Graph and Deep Manifold Learning. I've always likened mapper as a multi-dimensional convnet with local pooling on arbitrary data, and I wonder if there is some overlap/potential.

NIPS 2017 tutorial: Geometric Deep Learning on Graphs and Manifolds

Michael Bronstein · Joan Bruna · Arthur Szlam · Xavier Bresson · Yann LeCun The purpose of the proposed tutorial is to introduce the emerging field of geometric deep learning on graphs and manifolds, overview existing solutions and applications for this class of problems, as well as key difficulties and future research directions.

Geometric deep learning on graphs and manifolds using mixture model CNNs

Most of deep learning research has so far focused on dealing with 1D, 2D, or 3D Euclidean-structured data such as acoustic signals, images, or videos. Recently, there has been an increasing interest in geometric deep learning, attempting to generalize deep learning methods to non-Euclidean structured data such as graphs and manifolds, with a variety of applications from the domains of network analysis, computational social science, or computer graphics. In this paper, we propose a unified framework allowing to generalize CNN architectures to non-Euclidean domains (graphs and manifolds) and learn local, stationary, and compositional task-specific features. We show that various non-Euclidean CNN methods previously proposed in the literature can be considered as particular instances of our framework. We test the proposed method on standard tasks from the realms of image-, graph- and 3D shape analysis and show that it consistently outperforms previous approaches.

This paper does 3D shape classification, but I can see no comparison to using the mapper approach with Memoli's Gromov-Hausdorff method. I bet, a proper TDA benchmark for 3D shape comparison on current 3D and 2D datasets would already be valuable for this field of research. Going beyond their, rather simplistic, nearest neighbor mapping convolutions could be a major contribution (topological deep learning? Many mathematicians would rejoice.).

Deep Learning with Topological Signatures

Inferring topological and geometrical information from data can offer an alternative perspective on machine learning problems. Methods from topological data analysis, e.g., persistent homology, enable us to obtain such information, typically in the form of summary representations of topological features. However, such topological signatures often come with an unusual structure (e.g., multisets of intervals) that is highly impractical for most machine learning techniques. While many strategies have been proposed to map these topological signatures into machine learning compatible representations, they suffer from being agnostic to the target learning task. In contrast, we propose a technique that enables us to input topological signatures to deep neural networks and learn a task-optimal representation during training. Our approach is realized as a novel input layer with favorable theoretical properties. Classification experiments on 2D object shapes and social network graphs demonstrate the versatility of the approach and, in case of the latter, we even outperform the state-of-the-art by a large margin.

MLWave commented 6 years ago

Been experimenting with Connectomics and TDA. I have a dataset of a 1000 neurons of C. Elegans, with around 180000 neural activity timeseries, and for the 1 million neuron combinations whether they are connected or not.

Correlation distance matrices cluster many connected neurons together. L2norm lens over the distance matrices creates flares with, what I assume, are different types of neurons.

There is this call-for-research: https://ai-on.org/2017/11/20/discovering-manifold-psychiatric/

Psychiatric disorders are amongst the most difficult to accurately diagnose and design a treatment plan for. Imaging the structural and functional properties of an individual’s brain is the key to solving this challenge. Current machine learning approaches fail to utilize the information in these brain scans. Broadly, we propose finding a manifold of structural and/or functional brain scans in an embedding space that clusters mental disease states into clinically recognized classes. Specially, we propose a flavour of the adversarial autoencoder to accomplish this task. The goal is wide and ambitious enough to accommodate several other alternatives to solve this challenge.

And I think TDA should be able to help here too. I'll look if I can get reasonable access to their datasets (or existing canonical ones).

MLWave commented 6 years ago

Map a 2-D color image with overlapping intervals and different number of intervals/resolution. Shuffle the pixels inside the interval and reclassify with a pretrained Neural net (ResNet 50 trained on ImageNet): If classification probability goes down: image block is important, if classification probability goes up: image block is not important.

feature importance

We get results similar to LIME and extend "permutation feature importance" to unstructured data.