HDBScan performance issue with large dataset

divya-agrawal3103 commented 1 month ago

Hi Team,

We are currently running the HDBSCAN algorithm on a large and diverse dataset using one of our products to execute the script in Python. Below is the script we are using along with the input data:

from datetime import datetime
import pandas as pd
import modelerpy
modelerpy.installPackage('scikit-learn')
import sklearn
modelerpy.installPackage('cython')
modelerpy.installPackage('hdbscan')
import hdbscan
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder
import pkg_resources
from sklearn.decomposition import PCA

data = pd.read_csv("sample.csv")
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)
categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), categorical_features),
    ('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())
pca = PCA(n_components=2)
pca_result = pca.fit_transform(normalized_df)
print('build model start')
print(datetime.now().time())
try:
 model = hdbscan.HDBSCAN(
    min_cluster_size=1000,
    min_samples=5,
    metric="euclidean",
    alpha=1.0,
    p=1.5,
    algorithm="prims_kdtree",
    leaf_size=30,
    approx_min_span_tree=True,
    cluster_selection_method="eom",
    allow_single_cluster=False,
    gen_min_span_tree=True,
    prediction_data=True
    ).fit(pca_result)
 print('build model end') 
 print(datetime.now().time())
 #print(model)
 print("Cluster labels:")
 print(model.labels_)
 print("\nNumber of clusters:")
 print(len(set(model.labels_)) - (1 if -1 in model.labels_ else 0))
 print("\nCluster membership probabilities:")
 print(model.probabilities_)
 print("\nOutlier scores:")
 print(model.outlier_scores_)
except Exception as e:
  # Code to handle any exception
  print(f"An error occurred: {e}")

Sample file- sample.csv

We have performed preprocessing steps including OneHotEncoding, Scaling, and Dimensionality Reduction. The script executes in approximately 8 minutes. However, switching the algorithm from "prims_kdtree" to "best", "boruvka_kdtree", or "boruvka_balltree" results in a failure within a few minutes with the error message:

"An error occurred: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by excessive memory usage causing the Operating System to kill the worker."

Note: When executing the script using Jupyter Notebook, we obtain results for "best", "boruvka_kdtree", "boruvka_balltree", "prims_balltree", and "prims_kdtree" algorithms within a reasonable time.

Could you please help us with the following questions?

Why do "best", "boruvka_kdtree", and "boruvka_balltree" algorithms fail while "prims_balltree" and "prims_kdtree" do not?
What are the recommended best practices for optimizing HDBSCAN algorithm performance with large and varied datasets?
Does HDBSCAN support spilling to disk?

Your insights and guidance would be greatly appreciated.

Bokang-ctrl commented 1 month ago

Since you mentioned that the execution is successful on Jupyter Notebook, the problem could be with the memory usage. It seem there is no stability when executing in your script.

For optimizing, I would suggest you ensure that you have enough memory and CPU resources to handle the process. You could leverage GPU acceleration.

divya-agrawal3103 commented 1 month ago

Hi @Bokang-ctrl Thanks for your response. Could you please also clarify below 2 doubts?

1. What are the recommended best practices for optimizing HDBSCAN algorithm performance with large and varied datasets?
2. Does HDBSCAN support spilling to disk?

Thanks

Bokang-ctrl commented 1 month ago

Hi @divya-agrawal3103 . Apologies for getting back to you now. To answer your questions;

1. I would recommend using PCA for dimensionality reduction which will reduce the number of features and make the model effective. Try different scaling techniques (Robust scaler, Standard scaler & Min Max Scaler) and check which one gives the best results.

Try tuning your parameters, check the attached picture for the way I tuned my params. I'm pretty sure there are other ways but these are what I can think of. hyper params HDBSCAN

2. For spilling to disk, I ask chatGPT and this is what the response was: chatGPT: HDBSCAN itself does not natively support spilling to disk. The algorithm is designed to work in-memory, which means it requires sufficient RAM to handle the dataset being processed. However, you can manage large datasets using the following strategies:

Dask Integration:

Dask: Use Dask to handle large datasets and parallelize computations. Dask can spill intermediate results to disk, allowing you to work with datasets larger than your available memory.

Memory-Mapped Arrays:

NumPy.memmap: Use memory-mapped arrays to handle large datasets. This technique allows you to store data on disk while treating it as if it were in memory.

External Libraries:

Faiss: For large-scale clustering, consider using external libraries like Faiss, which can handle large datasets efficiently and integrate with HDBSCAN for nearest neighbor search.

Data Subsetting:

Sample Subsets: Process subsets of your data sequentially and then combine the results, if possible, to manage memory constraints.

scikit-learn-contrib / hdbscan

HDBScan performance issue with large dataset #645