oist-ncbc / spykesim

Extended edit similarity measurement for high dimensional discrete-time series signal (e.g., multi-unit spike-train).
https://pypi.org/project/spykesim
MIT License
20 stars 3 forks source link

Clustering on similarity matrix #22

Closed rcojocaru closed 4 years ago

rcojocaru commented 4 years ago

Hi.

In spykesim/editsim.pyx you perform the clustering directly on the similarity matrix, like this:

def clustering(self, min_cluster_size=5): """ Perform HDBSCAN clustering algorithm on the similarity matrix calculated by gensimmat """ self.clusterer = HDBSCAN(min_cluster_size=min_cluster_size) self.cluster_labels = self.clusterer.fit_predict(self.simmat)

Given that self.simmat is a similarity matrix, I think it should be first converted to a distance matrix. Then, then clustering should be performed using the metric='precomputed' option of HDBSCAN. Something like this (assuming distmat is the distance matrix obtained from self.simmat): self.clusterer = HDBSCAN(min_cluster_size=min_cluster_size, metric='precomputed') self.cluster_labels = self.clusterer.fit_predict(self.distmat)

I tried the code in its current state in the tutorial, and I get 4 'valid' clusters (index>=0) and 5 windows classified as noise. If I implement these modifications, I get 3 equal 'valid' clusters and 45 windows classified as noise, which makes more sense to me.

Thanks!

rcojocaru commented 4 years ago

I think I have managed to understand why you do the clustering directly on the similarity matrix. I had never seen it done this way, but I think it makes sense. This way, for each window, you use its similarity with all the other windows as a set of features on which the final clustering is based. This is what makes the code flexible enough to catch certain patterns from real data. Maybe the auto-similarity (main diagonal of the similarity matrix) is a bit problematic here, as sometimes it can contain very high values - but with many neurons it will not really affect the outcome.

The downside of this method is that some windows will be added to a cluster without actually having similar neuron activity (which is what I think is happening with the extra cluster from the tutorial, which even has 10 windows with 0 active neurons). There will also be "quiet clusters", formed out of windows with very little neuron activity, not necessarily similar in any other way. I see you exclude some of these windows/clusters when you compute the profiles. It might be interesting to use this process to also trim down the clusters members, to know at the end how many windows actually correspond to the cluster profiles obtained.

Anyhow, my initial issue is not really valid, sorry! Feel free to close it.

KeitaW commented 4 years ago

Sorry for the late reply. But

The downside of this method is that some windows will be added to a cluster without actually having similar neuron activity (which is what I think is happening with the extra cluster from the tutorial, which even has 10 windows with 0 active neurons). There will also be "quiet clusters", formed out of windows with very little neuron activity, not necessarily similar in any other way. I see you exclude some of these windows/clusters when you compute the profiles. It might be interesting to use this process to also trim down the clusters members, to know at the end how many windows actually correspond to the cluster profiles obtained.

part is surely an interesting enhancement of the feature. Let me take a look at code and see how easy/hard to implement that.