scikit-tda / kepler-mapper

Kepler Mapper: A flexible Python implementation of the Mapper algorithm.
https://kepler-mapper.scikit-tda.org
MIT License
631 stars 183 forks source link

Question: Metric selection is not available? #3

Closed yuzuhikorunrun closed 7 years ago

yuzuhikorunrun commented 7 years ago

Hi all,

I am switching from Ayasdi to open source Mapper and was looking for whether Kepler-mapper has a metric selection function, just like the "projection"/lens selection function in this class. As far as I know, in the original python Mapper algorithm, they have a list of metric options as follows:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

Can anyone help on this please?

-thanks.

MLWave commented 7 years ago

You could use a custom projection/lens type and feed this to mapper.map()

an example of this is in examples/breast-cancer where I create a custom lens with the IsolationForest:

# We create a custom 1-D lens with Isolation Forest
model = ensemble.IsolationForest(random_state=1729)
model.fit(X)
lens1 = model.decision_function(X).reshape((X.shape[0], 1))

Note the reshaping to turn 1-D into: [[val1], [val2]].

But I could fairly easily add those scipy distances into Kepler-Mapper itself. Scipy is required by Sklearn, so it would not create another dependancy.

Let me have a look at this and thanks for the suggestion!

MLWave commented 7 years ago

Ahh wait, I think those are the distance metrics used by the clusterer?

In that case you can set these yourself and feed it to the clusterer parameter.

Sklearn has Agglomerative Clustering which allows for distance metric selection. Here you set it with l2-distance:

graph = mapper.map(lens, 
                   X, 
                   nr_cubes=15, 
                   overlap_perc=0.7, 
                   clusterer=km.cluster.AgglomerativeClustering(n_clusters=2,
                                                                affinity="l2",
                                                                linkage="average"))

Note that you also have to set the linkage parameter to something other than "ward", or else it only accepts euclidean distance.

affinity : string or callable, default: “euclidean” Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or ‘precomputed’. If linkage is “ward”, only “euclidean” is accepted.

Since affinity can be a callable, you can probably even set it with those Scipy metrics. Else you've got all of: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

More: http://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_clustering_metrics.html http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

MLWave commented 7 years ago

DBSCAN (and HDBSCAN) also allow for metric selection:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

you get all of:

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.pairwise

MLWave commented 7 years ago

(In principle, there could also be filter functions which require vector data and do not work on a dissimilarity matrix. No such filter function is currently present in the module.) [Python Mapper]

I must admit I never used Ayasdi software. I think you are talking about turning the data into a (dis)similarity matrix. I added this functionality to Kepler-Mapper.

You can use any scipy.spatial.distance.pdist metric to create this Similarity Matrix, before any filter/projection/lens function is applied. Pass distance_matrix="euclidean" to fit_transform to get a Pairwise Similarity Matrix with euclidean distance.

I also added knn_distance_n to the selection of projections/lenses/filters. It calculates the sum of the distances to the n nearest neighbors. If you pass distance_matrix it uses the distance matrix to find neighbors, else it fits nearest neighbors on the vector data.

I am not sure if you perform clustering on the similarity matrix or on the original vectors. If you know this finesse, please let me know.

yuzuhikorunrun commented 7 years ago

thanks for the fast reply. I am indeed talking about applying the (scipy.spatial.distance.pdist) metrics before any lens function is applied (same as the standard procedure for model creation within Ayasdi)

But I am not sure what you mean by (dis)similarity matrix. What i am trying to do is to feed a m*n dimension dataset into this algorithm and let it find out the clusters for me.

Another question of interest is: does the lens/filter/projection parameter option include the columns (variables) in the dataset (so that we can build a supervised model)?

Thanks.

MLWave commented 7 years ago

But I am not sure what you mean by (dis)similarity matrix.

I looked at making Kepler-Mapper closer to http://danifold.net/mapper/filters.html

If the array data is two-dimensional of shape (n,d), the rows are interpreted as n data points in a d-dimensional vector space, and the pairwise distances are generated from the vector data and the metricpar parameter. See the function scipy.spatial.distance.pdist for possible entries in the metricpar dictionary.

This you can now do in Kepler-Mapper by feeding distance_matrix="euclidean".

(In principle, there could also be filter functions which require vector data and do not work on a dissimilarity matrix. No such filter function is currently present in the module.)

Requiring vector data is the standard in Kepler-Mapper, but now it also lets you create distance matrices first.

What i am trying to do is to feed a m*n dimension dataset into this algorithm and let it find out the clusters for me.

You can do this right now. Feed m samples with n dimensions and specify a projection type. If you specify distance matrix metric, your data gets turned into a m*m matrix of similarities with the chosen metric, before any specified lens is applied.

Another question of interest is: does the lens/filter/projection parameter option include the columns (variables) in the dataset (so that we can build a supervised model)?

It does not support this right now. I guess you could manually add the target column to the data first. You can also do something semi-supervised-y, when you pass the target column y to custom_tooltips=y and set color_function="average_signal_cluster".

If you explain better how the model itself is build, I can see to work that in.

Otherwise, I have an update planned where you can use Kepler-Mapper node ID's as a random forest leafs of sorts.

yuzuhikorunrun commented 7 years ago

Hi,

Thanks for the reply. The detailed "supervised model as in Ayasdi" is as below: 1) Given m samples with n dimensions, one (or more) of the n dimensions will be the outcome variable(s). 2) Choose the metric before the lens since metric selection largely depends on the nature of the data (e.g., if data is continous & measure comparable items, then use cosine, correlation,angle, euclidean). 3) Choose lens based on the goal of the network (whether is to identify anomalous / magnify difference between groups). 4) choose the outcome column(s) as lens to make the model a supervised one. In this case, if the outcome column is binary(Yes/No), the clusters will have 2 colors (but not necessarily only 2 clusters). We can also choose to map the nodes/edges using a different column of the dataset (instead of the outcome column), an example of this would be to color them using a continuous variable (say SaO2 Arterial Oxygen Saturation).

Supervised ones are not used often, since it actually leaks the information before mapping.. but in some occasions they are quite informative.

Does the above example help?

Besides this, I wonder if in your next update, can we see which node containing which ID and can we download these data points in each cluster (this is a feature in Ayasdi) ?