Open divya-agrawal3103 opened 4 months ago
Since you mentioned that the execution is successful on Jupyter Notebook, the problem could be with the memory usage. It seem there is no stability when executing in your script.
For optimizing, I would suggest you ensure that you have enough memory and CPU resources to handle the process. You could leverage GPU acceleration.
Hi @Bokang-ctrl Thanks for your response. Could you please also clarify below 2 doubts?
1. What are the recommended best practices for optimizing HDBSCAN algorithm performance with large and varied datasets?
2. Does HDBSCAN support spilling to disk?
Thanks
Hi @divya-agrawal3103 . Apologies for getting back to you now. To answer your questions;
1. I would recommend using PCA for dimensionality reduction which will reduce the number of features and make the model effective. Try different scaling techniques (Robust scaler, Standard scaler & Min Max Scaler) and check which one gives the best results.
Try tuning your parameters, check the attached picture for the way I tuned my params. I'm pretty sure there are other ways but these are what I can think of.
2. For spilling to disk, I ask chatGPT and this is what the response was: chatGPT: HDBSCAN itself does not natively support spilling to disk. The algorithm is designed to work in-memory, which means it requires sufficient RAM to handle the dataset being processed. However, you can manage large datasets using the following strategies:
Dask Integration:
Dask: Use Dask to handle large datasets and parallelize computations. Dask can spill intermediate results to disk, allowing you to work with datasets larger than your available memory.
Memory-Mapped Arrays:
NumPy.memmap: Use memory-mapped arrays to handle large datasets. This technique allows you to store data on disk while treating it as if it were in memory.
External Libraries:
Faiss: For large-scale clustering, consider using external libraries like Faiss, which can handle large datasets efficiently and integrate with HDBSCAN for nearest neighbor search.
Data Subsetting:
Sample Subsets: Process subsets of your data sequentially and then combine the results, if possible, to manage memory constraints.
Hi Team,
We are currently running the HDBSCAN algorithm on a large and diverse dataset using one of our products to execute the script in Python. Below is the script we are using along with the input data:
Sample file- sample.csv
We have performed preprocessing steps including OneHotEncoding, Scaling, and Dimensionality Reduction. The script executes in approximately 8 minutes. However, switching the algorithm from "prims_kdtree" to "best", "boruvka_kdtree", or "boruvka_balltree" results in a failure within a few minutes with the error message:
Note: When executing the script using Jupyter Notebook, we obtain results for "best", "boruvka_kdtree", "boruvka_balltree", "prims_balltree", and "prims_kdtree" algorithms within a reasonable time.
Could you please help us with the following questions?
Your insights and guidance would be greatly appreciated.