scikit-tda / kepler-mapper

Kepler Mapper: A flexible Python implementation of the Mapper algorithm.
https://kepler-mapper.scikit-tda.org
MIT License
628 stars 182 forks source link

What traits/aspects of the cancer data led you to choose the lenses? #134

Closed karinsasaki closed 5 years ago

karinsasaki commented 5 years ago

I am looking at the Cancer python notebook and, although I understand how the lenses project the data, I am still wondering how the two lenses (isolation forest and l2norm) were chosen to analyse this specific data?

More precisely, I want to know if there were some specific aspects of the data that made you think those two lenses were appropriate (e.g. do these lenses have some biological significance? or do you, a priori, know something about the "shape" of the data that tells you theses lenses are appropriate?); or did you do a sort of "grid search" of the choice of lenses and parameters for the map.mapper function (specifically the nr_cubes, overlap_perc, and the clusterer and its parameters)? If you did do a grid search, did you choose the "best performing" combination qualitatively (by looking at the topological graphs generated) or did you use some kind of quantitative measure (and what is it)?

Thank you so much for any guidance!

cmottac commented 5 years ago

This is a great question, I would really like to read an answer!

karinsasaki commented 5 years ago

Looking at the different effects different lenses have on the shape of the final topological graph, I have concluded that a reasoning behind a choice of a combination of lenses for a particular dataset is:

  1. Lenses that make biological sense; in other words, lenses that highlight special features in my data that I know about. I imagine that in the case of the cancer data, using an anomaly score (in this case calculated using the IsolationForest from sklearn) makes biological sense.
  2. Lenses that somehow disperse the data, as opposed to clustering many points together (see below).

Let's look at the cancer data:

This is the image of lens1 (the anomaly score using the IsolationForest): Image of lens1

(obtained using:

plt.scatter(lens1,np.zeros(len(lens1)))
plt.title('Image of lens1')

)

This is the image of lens2 (the l2norm): Image of lens2

This is the scatterplots for the two lenses color coded by the diagnosis: lens1 vs lens2

Using PCA I can also create another lens as follows:

from sklearn.decomposition import PCA
lens3 = mapper.fit_transform(X, projection=PCA(n_components=3), scaler=None)

This is the scatterplot of the first two PCA eigenvectors against each other: lens3[:,0] vs lens3[:,1]

and this is the scatter plot between lens1 (anomaly score using IsolationForest) and lens3[:,0] (the first Eigenvector of the PCA): lens1 vs lens3[:,0]

Notice it looks so similar (or the same(?)) as lens1 vs lens2.

And if we generate the topological graphs, this is what we get with the different lenses:

lens = np.c[lens1, lens2]_ lens1, lens2

lens = np.c[lens1, lens3[:,0]]_ lens1, lens3

lens = np.c[lens3[:,0],lens3[:,1]]_ PCS lenses

Again, using lens = np.c[lens1, lens2]_ and lens = np.c[lens1, lens3[:,0]]_ generates (what visually seems to be the same) topological graph. [As a side note, it would be nice to be able to compare topological graphs somehow quantitatively and not just visually, e.g. by calculating homology or some other way - any thoughts on this?]

This is what we get if we generate the topological graph using only lens 1: lens1

So for me, the conclusion for the cancer dataset is that using the anomaly score gives the global shape of the data and using another lens in addition, such as the l2norm or the first PCA eigenvector, gives a less granular view of the different regions of the topological graph.

(Note, the rest of the parameter values are the same for all topological graphs (e.g. number of hypercubes, clustered, percentage overlap); I have only changed the lens used.)


Two questions arise for me after understanding the choice of lenses:

  1. What does the graph tell me about the data?
  2. Why is the color code of the nodes the x-coordinate distance to the min value of lens 1 instead of, for example, by the values of one of the features? This option, I think, would highlight features that are decisive in malignant vs non-malignant.

But perhaps to answer these one needs to know more about the biology of this particular dataset.

sauln commented 5 years ago

This a great write-up! What do you think about incorporating it into the documentation as a tutorial page?

karinsasaki commented 5 years ago

@sauln Yes, I'd be happy to! I'll have a look at the other tutorials to get a better idea on structure.

Can you tell me if I need to make a PR or send it over in some other way?

cmottac commented 5 years ago

@karinsasaki thank you for your nice explanation.

  • What does the graph tell me about the data? Something that may be useful: if you color the nodes based on your target value y, you may notice hotspot regions that are characterized by a larger presence of positively targeted observations. For instance, look at the right flare on the below picture.
screen shot 2019-03-01 at 17 57 20
sauln commented 5 years ago

Most of the other tutorials were implemented in Jupiter notebooks and then converted to docs pages during the build process. That's fine if you'd like to go with that. Also, writing in RST is also fine.

a PR would preferable. Once you submit it, I can do a code review/copy edit pass.

karinsasaki commented 5 years ago

@sauln sorry it took me a while! I needed to finish other stuff first. I've sent over a PR - let me know if you think of any changes or improvements!

@carlomotta yea, I also though about colouring by the labels (y)! I suppose the quantifying the points on that flare would make sense.

cmottac commented 5 years ago

@karinsasaki I believe that it would be highly beneficial an automated process to:

karinsasaki commented 5 years ago

@carlomotta do you have a paper (or something else) that showcases these three things being done clearly for a dataset (whatever the dataset is, it doesn't matter to me). I would love to see a clear example of, not only how to use the mapper algorithm, but also how to make clear conclusions from the topological graph, that are relevant for the data.

Regarding the quantifying of the nodes, I believe you can do this partly with the kmapper in an automated way, if you specify certain parameters in the mapper.visualize function -> see the documentation for the visualize function here - e.g.:

.
.
.
X: numpy arraylike
            If supplied, compute statistics information about the original data source with respect to each node.
.
.
.
lens: numpy arraylike
            If supplied, compute statistics of each node based on the projection/lens

What do you mean by "qualifying those records"?

Would you mind also sharing here the lenses and other parameter values that you used to generate the beautiful graph above, which also seems to consist of only one connected component? (Note that the topological graph from the example consist of more than one connected component.)

cmottac commented 5 years ago

@karinsasaki please, check out this paper.

By qualifying a group in a topological network, I mean assessing which features in the original multidimensional space are more "responsible" of the data belonging to that group. For instance, you can perform statistical tests (e.g. Kolmogorov–Smirnov test) to compare the distribution of each feature within a group and on all the complementary nodes.

cmottac commented 5 years ago

@karinsasaki please, also have a look at this paper.

MLWave commented 5 years ago

I am the author of that example, and not a domain expert on cancer analysis. There must be better lenses and approaches available, perhaps in: 0 1

This is the original write-up: http://mlwave.github.io/tda/breast-cancer-writeup.html

I treated the problem as unsupervised anomaly detection. Isolation Forest is one of the best anomaly detectors I know of, and for L2-norm I reasoned that it would give anomalous points a different norm than typical points. 2-dimensional lenses are a bit richer.

I think the original Lum et al. used L2-norm over a distance matrix, not over the raw vectors. I'd probably try that now (to use distance_matrix and l2norm or knn_distance_5). Together with sklearn.cluster.agglomerativeClustering.

Kolmogorov–Smirnov test

You could also use the label distribution, whitebox classifier accuracy, or entropy of the clusters/nodes to get a fitness score for the graph. Then you could automatically pick from all anomaly detectors available. Though individually, lenses may give a different view all together.

For demonstrations like this, I am looking for richly connected graphs with good division.

MLWave commented 5 years ago

The original approach from Lum et al. is:

lens1: The target variable (survival or relapse) lens2: L-infinity centrality

Lens1 then causes the layout to separate on the target. Lens2 gives a centrality measure. You can then find the different types of survival or relapse, and flares should form more clearly (each signifying a different type of departure from the centrality).

Instead of the target variable, you may also use the predictive error of an ML model. Then you can study different types of failure, a la "Fibres of Failure".

karinsasaki commented 5 years ago

@carlomotta Thank you for the links! I'll check them out.

@MLWave Thanks so much for expanding on this example! It's great to know the thinking process.