scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 505 forks source link

Number of features has different (smaller) size after clustering #133

Open gabrielspadon opened 7 years ago

gabrielspadon commented 7 years ago

While analyzing the results of the clustering, I identified that the number of elements in the dataset is always smaller by one unit in the output of the same data. Here output means plotting a Single Linkage Tree image from HDBSCAN library and then converting it to Pandas or Numpy representation.

import pandas as pd import hdbscan as db from sklearn.datasets import make_blobs

blobs, _ = make_blobs(n_samples=1000, n_features=10)

clusterer = db.HDBSCAN(min_cluster_size=10, min_samples=2, approx_min_span_tree=False, gen_min_span_tree=True, match_reference_implementation=True, core_dist_n_jobs=-1).fit(blobs)

assert len(clusterer.single_linkage_tree_.to_pandas()) == len(blobs), 'Must be equal!'

The missing element means that the data changed while inside the algorithm. If the order/quantity is changed or if it is non-deterministic, it is impossible to keep track of the elements.

I extensively checked this issue and I couldn't find what was making the data to change.

gabrielspadon commented 7 years ago

Digging deeper I found this piece of documentation where the shape of the tree indicates that the value is always lower by one unit, but there is no explanation of why this is necessary.

single_linkage_tree : ndarray, shape (n_samples - 1, 4) The single linkage tree produced during clustering in scipy hierarchical clustering format (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html).

In view of this fact, I'm asking for help concerning the following questions.

lmcinnes commented 7 years ago

No elements are being removed, it is to do with the format in which the data of the dendrogram is stored. To quote the relevant documentation from scipy on the format (as per linked in the docstring):

A (n−1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n+i. A cluster with an index less than n corresponds to one of the n original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

In other words each row of the vector (or dataframe if returning pandas format) represents a merge operation, and in n-1 merge operations we arrive at the root node, thus there are always n-1 rows. The data points are given as indices in columns 1 and 2; they will be any entries with a value less than or equal to n (values greater than n are nodes in the cluster tree).

gabrielspadon commented 7 years ago

Thank you for the answer and sorry for the mess. All makes sense now.