HDBScan performance issue when choosing Best algorithm

divya-agrawal3103 commented 6 months ago

Hi,

I am attempting to execute a stream that is using HDBScan clustering algorithm on a set of input data to generate a model. When I am selecting the Algorithm as Best and randomly passing 10% of the total input data (The input data is a csv file that has 15 columns, and ~169379 rows) , the stream executes and never finishes, I tracked it till 5hrs 9 mins and then had to stop.

This is the piece of code from the python script that is getting used to build the model, and it is runs forever and is taking all the time.

hdb = hdbscan.hdbscan_.HDBSCAN(min_cluster_size=param['min_cluster_size'],
                                       min_samples=param['min_samples'],
                                       metric=param['metric'],
                                       alpha=param['alpha'],
                                       p=param['p'],
                                       algorithm=param['algorithm'],
                                       leaf_size=param['leaf_size'],
                                       approx_min_span_tree=param['approx_min_span_tree'],
                                       cluster_selection_method=param['cluster_selection_method'],
                                       allow_single_cluster=param['allow_single_cluster'],
                                       gen_min_span_tree=param['gen_min_span_tree'],
                                       prediction_data=True).fit(X)

Below are the inputs we are feeding- min_cluster_size = 50 min_samples = 5 metric = euclidean alpha = 1.0 p = 1.5 algorithm = best leaf_size = 30 approx_min_span_tree = True cluster_selection_method = eom allow_single_cluster = False gen_min_span_tree = True

Can you help us with this? 5hrs+ seems to be a lot of time. We need to optimise it. Note: This is happening if we choose the algorithm as Best and 10% of input data, with other algorithms it is finishing in reasonable time, also if we choose Best as the algorithm but only pass 8% of input data, it finishes within 2 minutes.

jc-healy commented 6 months ago

That is indeed strange behaviour. By 10% of input data I presume you mean clustering about ~16,000 points which are 15 dimensional. If so, 5hrs+ is remarkably slow and even two minutes is a bit slow. I can cluster 16,000 15 dimensional points with your parameters in about 4 seconds (truncatedSVD to 15 dimensions on top of MNIST). For scaling context I can handle 70,000 15 dimensional points in about 30 seconds.

My best guess is that there is something strange going on with your data being loaded from your csv. Is it properly numeric data? Or do you have 15 string columns that are being loaded as categorical values and being transformed via a one hot encoder or some such thing? Have you loaded it into a numpy array?

As an aside, I think your parameter of p=1.5 is being ignored. It is a parameter for Minkowski distance and should be ignored when your metric='euclidean'.

On Mon, Apr 22, 2024 at 5:30 AM divya-agrawal3103 @.***> wrote:

Hi,

I am attempting to execute a stream that is using HDBScan clustering algorithm on a set of input data to generate a model. When I am selecting the Algorithm as Best and randomly passing 10% of the total input data (The input data is a csv file that has 15 columns, and ~169379 rows) , the stream executes and never finishes, I tracked it till 5hrs 9 mins and then had to stop.

This is the piece of code from the python script that is getting used to build the model, and it is runs forever and is taking all the time.

hdb = hdbscan.hdbscan_.HDBSCAN(min_cluster_size=param['min_cluster_size'], min_samples=param['min_samples'], metric=param['metric'], alpha=param['alpha'], p=param['p'], algorithm=param['algorithm'], leaf_size=param['leaf_size'], approx_min_span_tree=param['approx_min_span_tree'], cluster_selection_method=param['cluster_selection_method'], allow_single_cluster=param['allow_single_cluster'], gen_min_span_tree=param['gen_min_span_tree'], prediction_data=True).fit(X)

Below are the inputs we are feeding- min_cluster_size = 50 min_samples = 5 metric = euclidean alpha = 1.0 p = 1.5 algorithm = best leaf_size = 30 approx_min_span_tree = True cluster_selection_method = eom allow_single_cluster = False gen_min_span_tree = True

Can you help us with this? 5hrs+ seems to be a lot of time. We need to optimise it. Note: This is happening if we choose the algorithm as Best and 10% of input data, with other algorithms it is finishing in reasonable time, also if we choose Best as the algorithm but only pass 8% of input data, it finishes within 2 minutes.

— Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/630, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWVUNKX7RQIT7JKYYALY6TKCTAVCNFSM6AAAAABGSM4ND6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TMMBUHAYTGOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

divya-agrawal3103 commented 6 months ago

Hi @jc-healy Thanks for the swift response. I am attaching the input data file here, and as far as I can see it comprises of categorical columns (Gender,Marital Status). Could you please try using this input to test ? Really appreciate your time! sample.zip

divya-agrawal3103 commented 6 months ago

Hi @jc-healy We are stuck and really looking forward for any inputs from your side in order to resolve the problem. Thank you in advance.

jc-healy commented 6 months ago

Hi there, I grabbed your data and filtered out the categorical columns (and your customer ID column) before hitting it with hdbscan and it took 3 to 5 minutes for me to cluster the 198,000 records.

Looking at your data I have two recommendations for clustering. First your numeric values are on vastly different scales. So Euclidean distance over this data will be dominated by your Income column which is on a vastly different scale that the "Members Within Household". To fix that I'd use something like a RobustScaler from sklearn.preprocessing to normalize your numeric columns. You can do fancier things but that's a pretty solid first thing to try.

I'd also one hot encode your two categorical fields to convert them to numeric. I'd do this in a pipeline using sklearns OneHotEncoder. Again you can get fancier but this is a good start.

As general good practice I'd suggest wrapping your preprocessing in a ColumnTransformer. That is a generally good practice for keeping track of your column transformations so they can be consistently applied to future data. Not necessary here but still a good habit.

Here is some sample code to get you started:

import hdbscan
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder

data = pd.read_csv('sample.csv')
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)

categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), categorical_features),
    ('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())

model = hdbscan.HDBSCAN(min_cluster_size = 50, min_samples = 5).fit(normalized_df)

Cheers, John

divya-agrawal3103 commented 6 months ago

Hi @jc-healy Thanks a lot for the detailed analysis. Will try to incorporate the suggestions. Appreciate your time.

jc-healy commented 5 months ago

Closing this for now

scikit-learn-contrib / hdbscan

HDBScan performance issue when choosing Best algorithm #630