Number of Dimensions equals to clusters number

dvaler01 commented 4 years ago

Hello,

I read the paper a couple of times, and everything was clear to me (for now) except 2 points.

First why are you setting as the number of dimensions to be the clusters number? (when you are using the manifold learning algorithm UMAP etc) Second, for the visualization are you changing the number of the dimensions to 2, or you are adding one more manifold LA with the number of dimensions to 2?

Dimitris.

rymc commented 4 years ago

Hi Dimitris,

I set the number of dimensions of the manifold to be the number of clusters as, heuristically, it seemed reasonable to me. It may well be that there is a better approach here.

For visualisation I set the manifold dimensions to be 2. Reapplying the manifolder learner with dimensions 2 to the k manifold (where k is the number of clusters) was something I considered, however I wasn't too sure what it meant to reapply the manifolder learner to the already learned manifold.

I'm curious what your thoughts are w.r.t. reapplying the manifolder learned to the existing learned manifold?

Best, Ryan

dvaler01 commented 4 years ago

Hello Ryan,

About the number of dimensions, intuitively makes sense to me so I will follow your approach. But maybe it's depends also from the dataset and/or other components. What I want to say: if I have a dataset of k clusters, but I know that 2 clusters are close to each other, maybe I can use k-1 dimensions (just an example).

About the reapply of the manifold learner, my thought was to apply the UMAP with k dimensions, and then apply Isomap to take it to 2D while retaining the global structure. At the end of the day it doesn't matters for this case, but I was just thinking the alternatives.

P.S. for my case the number of clusters is not "clearly" known, and I was thinking at the end to apply a Density Based approach to explore further the clusters (just to make sense for you why I am saying the above).

Dimitris.

rymc commented 4 years ago

Hi Dimitris

Yes, I think if you have some more knowledge about the clusters, then using this when setting the manifold size makes sense to me. However, I believe it should be fairly robust for sensible values, so I think it will be OK for cases where you don't know the number of clusters.

The density based clustering is a great idea. @josephsdavid has been working on a library-like implementation here [1]. I'm not too familiar with the current state of the library, but perhaps it may be easier to extend with a density based approach than my code.

Let me know if you've any other questions, thoughts or comments. I'm happy to help if I can.

[1] https://github.com/josephsdavid/N2D

Best Ryan

josephsdavid commented 4 years ago

Hi Dimitris! I actually have an example of density based clustering I wrote as scratch work when I was first starting the project! Currently it needs to be updated A) to match up with current state of the library and likely be actually packaged and B) the library needs to be compatible with TF 2 now that it’s not so slow (should be easy)

As far as any intuition w.r.t. the number of dimensions vs the number of clusters, I have none at this point, but one thing I have noticed is that if your loss with the autoencoder is high (let’s say you do early stopping or something), using a higher number of neighbors in UMAP (20-30 instead of 10-20) tends to produce better results

old example of density based clustering

simple guide to put a new clustering algorithm in the library

josephsdavid commented 4 years ago

@dvaler01 Just finishing up a big refactor of the code, hopefully this will be useful for you:

import os
import n2d
import random as rn
import numpy as np
import n2d.datasets as data
import hdbscan
import umap

# load up mnist example
x,y = data.load_mnist()

# autoencoder can be just passed normally, see the other examples for extending
# it
ae = n2d.AutoEncoder(input_dim=x.shape[-1], output_dim=20)

# arguments for clusterer go in a dict
hdbscan_args = {"min_samples":10,"min_cluster_size":500,"prediction_data":True}

# arguments for manifold learner go in a dict
umap_args = {"metric":"euclidean", "n_components":2, "n_neighbors":30,"min_dist":0}

# pass the classes and dicts into the generator
# manifold class, manifold args, cluster class, cluster args
db = n2d.manifold_cluster_generator(umap.UMAP, umap_args, hdbscan.HDBSCAN, hdbscan_args)

# pass the manifold-cluster tool and the autoencoder into the n2d class
db_clust = n2d.n2d(db, ae)

# fit
db_clust.fit(x, epochs = 10)

# the clusterer is a normal hdbscan object
print(db_clust.clusterer.probabilities_)

print(db_clust.clusterer.labels_)

# access the manifold learner at
print(db_clust.manifolder)

# if the parent classes have a method you can likely use it (make an issue if not)
db_clust.fit_predict(x, epochs = 10)

# however this will error because hdbscan doesnt have that method
db_clust.predict(x)

# predict on new data with the approximate prediction

x_test, y_test = data.load_mnist_test()

# access the parts of the autoencoder within n2d or outside of it
test_embedding = ae.encoder.predict(x_test)
test_n2d_embedding = db_clust.encoder.predict(x_test)

test_embedding - test_n2d_embedding
# all zeros 

test_labels, strengths = hdbscan.approximate_predict(db_clust.clusterer, db_clust.manifolder.transform(test_embedding))

print(test_labels)
print(strengths)

Will have the much more easily extensible refactor (e.g. the source of this code) up within the next 24 hours

dvaler01 commented 4 years ago

Hello both,

Thank you a lot for your messages. I checked the library and the corresponding documentation and I think it will be very useful and handy. I will start with it as my baseline and I will contact with you if I will have any more questions. Great work :)

Dimitris.

josephsdavid commented 4 years ago

Should have new version up and documented within an hour! Cheers!

rymc / n2d

Number of Dimensions equals to clusters number #5