UMAP and HDBSCAN on 7.5 million Dataset

raina678 commented 5 months ago

Hello,

I'm working with a very large dataset consisting of 7.5 million rows and 18 columns, which represents customer purchase behavior. I initially used UMAP for dimensionality reduction and attempted to tune the UMAP hyperparameters on a 5% subset of the dataset. However, I found that not all UMAP hyperparameters scale well, so I aim to tune both UMAP and HDBSCAN hyperparameters on the full dataset.

I'm using AWS SageMaker, so memory is not a constraint. Initially, I used the CPU implementation, but a single iteration was very time-consuming. Therefore, I switched to using RAPIDS AI for the GPU implementation of both UMAP and HDBSCAN. This approach ran significantly faster on G5 instances, with UMAP taking 16 minutes and HDBSCAN taking 1 hour and 16 minutes.

Afterwards, I used the Silhouette Score for evaluation (RAPIDS AI doesn't support relative validity of HDBSCAN). However, when attempting to run Bayesian Optimization, I encountered memory errors due to the G5 instance having only one GPU with 24GB of total GPU memory.

Question: How can I effectively tune hyperparameters for such a large dataset? Is it reasonable to tune the hyperparameters on a 5% subset and then apply those hyperparameters to the entire dataset?

Any ideas or references would be greatly appreciated. Thank you! @lmcinnes

lmcinnes commented 5 months ago

HDBSCAN should not really be taking that long, especially the GPU implementation, if you are reducing to a reasonable number of dimensions. So something odd is going on there.

raina678 commented 5 months ago

Thank you for your reply!

I am performing dimensionality reduction to 8 dimensions using UMAP. I am utilizing the ml.g5.2xlarge instance with the PyTorch 2.0.0 Python 3.10 GPU Optimized image.

Initially, I am applying a Power Transformer to preprocess my numerical columns. Following this, I apply UMAP for dimension reduction, and then use HDBSCAN for clustering. Below is the code I am using:

from cuml.manifold import UMAP as cumlUMAP
from cuml.cluster import HDBSCAN as cumlHDBSCAN
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import silhouette_score

# Apply Power transformer
def transform_data(df):
    # Separate numerical and categorical columns
    numerical_cols = df.select_dtypes(include=['int', 'float']).columns
    categorical_cols = df.select_dtypes(include=['category']).columns

    # Apply PowerTransformer to numerical data
    transformer = PowerTransformer()
    transformed_num = transformer.fit_transform(df[numerical_cols])   

    transformed_df = pd.DataFrame(transformed_num, columns=numerical_cols)
    transformed_df[categorical_cols] = df[categorical_cols] 

    return transformed_df

transformed_df = transform_data(sample_data)

# Apply UMAP
umap_model = cumlUMAP(n_neighbors=11, min_dist=0.14, n_components=8, metric='euclidean', random_state=42, verbose=6)
embedding = umap_model.fit_transform(transformed_df)

# Apply HDBSCAN
clusterer = cumlHDBSCAN(gen_min_span_tree=True)
labels = clusterer.fit_predict(embedding)

# Calculate silhouette score on the sampled embeddings and labels
silhouette = silhouette_score(embedding, labels, sample_size=300000)

lmcinnes commented 5 months ago

Hmm, not sure why HDBSCAN is so slow then. I would not imagine it would take that long. You may want to try using fast_hdbscan instead -- on a machine with a reasonable number of cores it should be faster than that despite being CPU based.

I also wouldn't use silhouette score; it's not the best for these purposes. DBCV would be best, but Davies-Bouldin might also work okay.

raina678 commented 5 months ago

Thank you for your response!

A single iteration of UMAP using CPU implementation took about 4.5 hours. Would you recommend performing UMAP with GPU implementation followed by fast_hdbscan, and then using the relativevalidity (DBCV) metric?

Also, do you think Bayesian optimization would be effective for tuning the parameters of UMAP and HDBSCAN together?

lmcinnes commented 5 months ago

I think GPU UMAP and fast_hdbscan should work well together. I'm not sure Bayesian optimization will be any better than a well planned grid-search strategy, but it depends on how many parameters you are going to try to optimize over.

raina678 commented 5 months ago

Thank you so much again! I'll try GPU UMAP and fast_hdbscan!

After checking, I found that fast_hdbscan does not support relativevalidity. Moreover, calculating metrics like DBCV and Davies-Bouldin for a dataset with 7.5 million records could be quite time-consuming. I'm interested in your thoughts on which evaluation metric to use and the best approach.

Would it be reasonable to sample 5% or 10% of the data after performing HDBSCAN clustering and calculate the DBCV score on the sampled data?

raina678 commented 3 months ago

Hello,

I worked on a smaller dataset using GPU UMAP and Fast HDBSCAN, evaluating the results with the Davies-Bouldin metric. After performing Bayesian optimization, I found hyperparameters that resulted in a score of less than 2. However, the issue is that nearly 90% of the data points end up in a single cluster. Despite trying various parameter combinations, this issue persists. Is there a way to address this problem, or should I re-run HDBSCAN on the dominant cluster using the same UMAP embedding?

For reference, here’s how the data looks in the 2D UMAP embedding, with the center being very dense.

lmcinnes commented 3 months ago

If you really need smaller clusters then cluster_extraction_method`="leaf" should work for you. You may also want to look at using datashader for plotting/visualization since there is a lot of overplotting when dealing with 7.5 million points (or even 100,000 points).

raina678 commented 3 months ago

Thank you for your reply!

I'm working with around 1 million data points. My objective function constraints include limiting the cluster count to between 5 and 20, with outliers kept under 20%. The best hyperparameters yield a Davies Bouldin score of 1.4, as shown below. Additionally, I've attached my objective function for reference:

def umap_hdbscan_objective(n_neighbors, min_dist, n_components, min_samples, min_cluster_size, outlier_percentage=0.2):
    """
    Objective function for optimizing UMAP and HDBSCAN parameters using the Davies-Bouldin Index

    Returns:
        float: Davies-Bouldin score if clustering is valid (lower values indicate better clustering), otherwise a penalty score indicating poor clustering

    Notes:
    - Clustering is considered failed if:
        - No valid embeddings or labels are produced (returns -1000.0)
        - All points are classified as outliers (returns -1000.1)
        - Less than 5 clusters are detected (returns -1000.2)
        - More than 20 clusters are detected (returns -1000.3)
        - The number of outliers exceeds the specified percentage of the total data points (returns -1000.4)
    """
    umap_embedding, cluster_labels = umap_hdbscan(
        df=transformed_df,
        n_neighbors=round(n_neighbors),
        min_dist=min_dist,
        n_components=round(n_components),
        # metric=metric,
        min_samples=round(min_samples),
        min_cluster_size=round(min_cluster_size),
        verbose=True
    )

    if umap_embedding is None or cluster_labels is None:
        print('Clustering failed or no valid embeddings/labels')
        return -1000.0  # High score indicating poor clustering

    if max(cluster_labels) == -1:
        print('No clusters found - all points are outliers')
        return -1000.1  # High score indicating poor clustering

    # check for number of clusters
    num_clusters = len(set(cluster_labels))
    if num_clusters < 5:
        print(f'Too few clusters detected: {num_clusters}')
        return -1000.2  # High score indicating poor clustering
    if num_clusters > 20:
        print(f'Too many clusters detected: {num_clusters}')
        return -1000.3  # High score indicating poor clustering

    # check for outliers count
    total_points = len(transformed_df)
    outlier_count = (cluster_labels == -1).sum()

    # Calculate the threshold based on the specified percentage of total data points
    outlier_threshold = outlier_percentage * total_points

    # Check outlier condition
    if outlier_count > outlier_threshold:
        print(f'Outliers exceed {outlier_percentage*100}% of the total number of data points: {outlier_count}')
        return -1000.4  # High penalty score indicating poor clustering

    # cluster_size_counts = pd.Series(cluster_labels).value_counts()
    # min_cluster_size = cluster_size_counts.min()
    # max_cluster_size = cluster_size_counts.max()

    db_score = davies_bouldin_score(umap_embedding, cluster_labels)
    print(f"Davies-Bouldin score: {db_score}")

    return -db_score # Lower is better, closer to 0 is ideal

raina678 commented 3 months ago

Hi,

I used cluster_extraction_method="leaf" for a dataset of around 1 million data points and applied Bayesian optimization. However, the results were not satisfactory:

Cluster Distribution: Approximately 80-90% of the data points were classified as noise, with the remaining data points clustered into very small sizes.
Constraints: None of the iterations met the constraints I set, specifically: cluster count between 5 and 20 and outliers under 20%

I’m looking for suggestions on how to improve the clustering results or alternative approaches to achieve the desired clustering constraints

lmcinnes commented 3 months ago

I'm not sure how to achieve all your desiderata here. You can just run K-Means with k=20 and get something that meets your requirements; that may not exactly be a desired result though. It may be the case that the data does not cluster in the way you wnat it to.

lmcinnes / umap

UMAP and HDBSCAN on 7.5 million Dataset #1133