Open KukumavMozolo opened 6 months ago
@KukumavMozolo any success with this? My (limited) experience with this is that hdbscan does not support sparse matrices for the distance matrix.
Unfortunately I gave up on it for now and just uniformly down-sampled the data:(
However, i revisited the problem and managed to pre compute a sparse distance matrix semi efficiently using numba(and numba-progress for verbosity) and only storing distances smaller than a threshold. Also i put an epsilon value on the diagonal of the distance matrix, not sure this is necessary. This is example code for data that is encoded in a scipy sparse csr format. One has to hand in data, indices, ind_ptr directly since csr_matrix is not well supported by numba. With the distance matrix computed one has to set hdbscan metric to "precomputed":
@numba.jit(nogil=True,parallel=True)
def _pairwise_distances_sparse_dist(n_points: int, indptr: np.ndarray, indices: np.ndarray, data: np.ndarray, metric: str, progress, threshold: float, epsilon: float, tup_type):
if metric == "euclidean_distance":
func = sparse_euclidean
else:
raise NotImplementedError
results = [numba.typed.List.empty_list(tup_type)] * n_points
for i in prange(n_points):
i_start = indptr[i]
i_end = indptr[i+1]
i_indices = indices[i_start:i_end]
i_data = data[i_start:i_end].astype(np.float32)
data_col = numba.typed.List.empty_list(tup_type)
data_col.append((epsilon, int(i)))
for j in range(i+1,n_points):
j_start = indptr[j]
j_end = indptr[j+1]
j_indices = indices[j_start:j_end]
j_data = data[j_start:j_end].astype(np.float32)
res = func(i_indices, i_data, j_indices, j_data)
if res < threshold:
data_col.append((res, j))
results[i] = data_col
progress.update(1)
data = []
indices = []
row_ptr = [0]
for idx, res in enumerate(results):
row_ptr.append(row_ptr[idx] + len(res))
for (d,c) in res:
data.append(d)
indices.append(c)
results[idx] = numba.typed.List.empty_list(tup_type)
return np.array(data), np.array(indices), np.array(row_ptr)
This potentially produces a disconnected graph in some cases, so one needs to join the disconnected components. Here i just picked the first node in one graph and connected it to the first other nodes in the other graphs via a unified max_distance value. This might affect the solution hdbscan computes so one might want to do this differently. (e.g. connect them at their closest point with the actual distance for example)
def _connect_components(distance_matrix: csr_matrix, max_distance: float)->csr_matrix:
n_components, labels = connected_components(csgraph=distance_matrix, directed=False, return_labels=True)
unique_labes = np.unique(labels)
start = time.time()
label_idx_dict = {}
for idx, l in enumerate(labels):
if l not in label_idx_dict:
label_idx_dict[l]=idx
end = time.time()
print(f"Computing label_idx_dict took {end-start} seconds")
others = unique_labes.copy()
rows = []
cols = []
data = []
for idx, l in tqdm(enumerate(unique_labes), total=n_components):
others = others[1:]
for other in others:
other_idx = label_idx_dict[other]
rows.append(idx)
cols.append(other_idx)
data.append(max_distance)
rows.append(other_idx)
cols.append(idx)
data.append(max_distance)
new_mat = csr_matrix((data,(rows,cols)),shape=distance_matrix.shape)
start = time.time()
distance_matrix = distance_matrix + new_mat
end = time.time()
print(f"Computing merged distance_matrix took {end-start} seconds")
new_n_components, new_labels = connected_components(csgraph=distance_matrix, directed=False, return_labels=True)
print(new_n_components)
return distance_matrix
Hi!,
so i am working at the following problem i have millions of sparse data points that are very high dimensional.
Using a sparse precomputed distance matrix seems one way to feed this data into hdbscan.
My current idea is to only store those distances that are below a certain threshold or use a fixed number of distances for every point and than ensuring that there are no disconnected components in the resulting graph. How do hdbscan's hyperparameters interact with the required level of sparsity of that matrix. e.g. given a fixed
min_cluster_size
,min_samples
andcluster_selection_epsilon
how would that constrain the threshold or the number of distances per point so that the resulting clustering is no different from when providing the full distance matrix?