scikit-tda / kepler-mapper

Kepler Mapper: A flexible Python implementation of the Mapper algorithm.
https://kepler-mapper.scikit-tda.org
MIT License
627 stars 182 forks source link

Filter out nodes in graph based on number of elements #185

Open torlarse opened 4 years ago

torlarse commented 4 years ago

Is your feature request related to a problem? Please describe. I have a point cloud with approximately 2000 data points. The clustering is based on a precomputed distance matrix. The graph produced by Mapper has a lot of nodes with between 1 and 5 elements, cluttering the visualization of the graph/ complex.

Describe the solution you'd like I would like to specify a minimum number of elements in a node of the graph for visualization.

Describe alternatives you've considered Filter out minor nodes from the dictionary produced by mapper.map with a dict comprehension. I have looked at scikit-learn docs for some solution on filtering out clusters with a small number of elements. I have not found such parameters when using precomputed metric. Increasing distance_treshold helps up to a certain extent.

Additional context I would like to discuss how to remove nodes from graph without affecting the topology of the dataset.

torlarse commented 4 years ago

For the record, I tried manipulating the

if precomputed:
            min_cluster_samples = 2

condition in kmapper.py. To my understanding this defines the minimal number of samples in each cube of the covering. This certainly works as a reduction method but it is difficult to know a priori what the number of samples should be. Pushing the minimum samples results in different topologies.

I guess it is safer to manipulate the graph after both clustering and mapper?

deargle commented 4 years ago

It looks like you were the one who authored the commit that added the fixed min_cluster_samples = 2 for precomputed? Why was fixing that necessary? I haven't used kepler-mapper recently, but I was using a precomputed matrix before without a fixed number. I wrote the initial precomputed logic flag, which pulls (used to pull before your commit) the min_computed_samples from the clusterer, as documented here: https://github.com/scikit-tda/kepler-mapper/blob/2ccce2d891165eead157f320fbc0a052c54c5540/kmapper/kmapper.py#L396-L401

deargle commented 4 years ago

If I loosely understand you commit message https://github.com/scikit-tda/kepler-mapper/commit/2ccce2d891165eead157f320fbc0a052c54c5540 , then it's possible that the clusterer not have n_clusters set. The logic here:

min_cluster_samples = cluster_params.get(
                "n_clusters",
                cluster_params.get(
                    "min_cluster_size", cluster_params.get("min_samples", 1)
                ),
            )

Is intended to first try n_clusters -- failing that, try min_cluster_size, -- failing that, try min_samples. What were you running where the clusterer had none of those three set which was leading to a crash?

torlarse commented 4 years ago

Yes I issued the pull request you are mentioning. I seem to recall that I got a scikit-learn chrash if affinity='precomputed' and n_clusters was not None. Does that make sense?

I can not justify the fixed number 2, probably from the standard n_clusters in Scikit docs.

I must admit I did not notice the logic you are referring to, I was naively focused on fixing that specific error mentioned in the pull request. Sorry!

deargle commented 4 years ago

Want to issue a new PR that still works for your usecase which was breaking, and which re-adopts that more flexible logic?

torlarse commented 4 years ago

I certainly can if I am able to get the logic :)

  1. By using affinity=precomputed, doesn't Scikit require distance_treshold to be set?
  2. How "safe" is it to manipulate min_cluster_samples a priori, compared to adjusting after Mapper has returned the complex?
deargle commented 4 years ago

By using affinity=precomputed, doesn't Scikit require distance_treshold to be set?

Which scikit modeler are you referring to? DBSCAN doesn't -- I have used DBSCAN thusly: clusterer=km.cluster.DBSCAN(eps=.5, min_samples=4, metric='precomputed')

How "safe" is it to manipulate min_cluster_samples a priori, compared to adjusting after Mapper has returned the complex?

That would be a tad hairy to manipulate afterwards because the complex could include links to clusters which might be dropped during adjustment, but in my naive opinion it's fine to do either before or after.

torlarse commented 4 years ago

ah sorry for not specifying, I have used sklearn.cluster.AgglomerativeClustering 1 with single linkage. I am trying to reproduce the work of Carlsson and Gabrielsson 2.

torlarse commented 4 years ago

Here is my full clustering section:

from sklearn.cluster import AgglomerativeClustering as cl
point_cloud_column_variance = np.var(point_cloud, axis=0)
point_cloud_normalized = point_cloud / point_cloud_column_variance
condensed_distance_matrix = spatial.distance.pdist(point_cloud_normalized)
redundant_distance_matrix = spatial.distance.squareform(condensed_distance_matrix)
model_vne = cl(n_clusters=None,
               affinity='precomputed',
               compute_full_tree=True,
               linkage='single',
               distance_threshold=15.0)
model_vne.fit(redundant_distance_matrix)

EDIT: added clustering import statement

deargle commented 4 years ago

I re-read your earlier comment:

To my understanding this defines the minimal number of samples in each cube of the covering. This certainly works as a reduction method but it is difficult to know a priori what the number of samples should be. Pushing the minimum samples results in different topologies.

It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it. The logic is that if the hypercube doesn't have min_samples, then the clusterer run on that hypercube would definitely not find any cluster of points within that hypercube with at least min_samples.

Because it skips, rather than expands the window, there shouldn't be any difference in mapper output whether clusters are filtered out before or after the graph is created.

... except, hmm, in your case, where min_samples isn't used, I see that agglomerativeclusterer breaks when not fed at least two samples. https://github.com/scikit-learn/scikit-learn/blob/e5698bde9a8b719514bf39e6e5d58f90cfe5bc01/sklearn/cluster/_agglomerative.py#L796 I forgot to say an "otherwise" condition for this logic:

min_cluster_samples = cluster_params.get(
                "n_clusters",
                cluster_params.get(
                    "min_cluster_size", cluster_params.get("min_samples", 1)
                ),
            )

... failing the first three attempts, a min_cluster_sample size of 1 is used.

I wonder (1) whether some persons would be interested in retaining one-element clusters (I bet the answer is "yes"), and if so, (2) how best to tell if a clusterer that requires at least two samples is being used.

deargle commented 4 years ago

Perhaps the default value of n_clusters for the constructor could be used as a fallback. For agglomerativeclustering, that's 2. This method can be used to get a function's default argument values.

deargle commented 4 years ago

Or maybe if cluster_params distance_threshold is set, then default to min 2. I don't know how well that would generalize across scikit clusterers though.

deargle commented 4 years ago

@sauln what do you think about the following: allow min_cluster_samples to be passed to .map(), which would

I don't see an easier way to handle clusterers such as agglomerativeclusterer which, in @torlarse 's case, have to have at least two samples per clustering attempt, but don't necessarily have any of n_clusters, min_cluster_size, or min_samples set. My above-idea of checking default values for n_samples doesn't generalize well to clusterers such as Birch, and also, otherwise, there is no reliable kmapper-way to specify a minimum number of samples per cluster.

torlarse commented 4 years ago

It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it.

Yes thanks, I hope I have this understanding of hypercubes, pardon my imprecise english. By setting min_cluster_samples=20 on my data caused Mapper to return a "broken" non-circular topology. It makes sense since by inspection I can see the nodes around the holes have between 10 and 15 elements.

Anyways, I made things work by a combination of setting min_cluster_samples=10 in kmapper.py and distance_treshold=15 in the clustering function call. I don't know if this will be a problem experienced by others.

deargle commented 4 years ago

I think agglomerative clustering is common enough that ideally kmapper should accommodate it without users having to modify kmapper.py -- glad you found a workaround though.

On Tue, Jan 7, 2020, 5:47 AM torlarse notifications@github.com wrote:

It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it.

Yes thanks, I hope I have this understanding of hypercubes, pardon my imprecise english. By setting min_cluster_samples=20 on my data caused Mapper to return a "broken" non-circular topology. It makes sense since by inspection I can see the nodes around the holes have between 10 and 15 elements.

Anyways, I made things work by a combination of setting min_cluster_samples=10 in kmapper.py and distance_treshold=15 in the clustering function call. I don't know if this will be a problem experienced by others.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-tda/kepler-mapper/issues/185?email_source=notifications&email_token=AAI6Y7JUDUUQODOTUDMCOUTQ4R2XTA5CNFSM4KDGZDBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIIYA6I#issuecomment-571572345, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI6Y7IHBGICLO4K4K2DQGLQ4R2XTANCNFSM4KDGZDBA .

deargle commented 4 years ago

@sauln I'm going to make the final fallback be to min of 2, not min of 1, and take out the hard fix of 2 for precomputed matrices

sauln commented 4 years ago

Seems reasonable to me 👍

deargle commented 3 years ago

I just found this again because I have @torlarse 's original problem -- I want to filter out nodes that are too small for my liking, and min_samples doesn't do that.

@sauln I'm going to add an argument to .map that does what was originally asked -- specify a min number of points that a cluster must have in order to be retained as a node. "min_node_samples"?

I'll also add a filter to the vis, because hey why not