kalininalab / DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.
https://datasail.readthedocs.io
MIT License
14 stars 1 forks source link

Support other clustering algorithms #25

Open isty2e opened 1 month ago

isty2e commented 1 month ago

Is your feature request related to a problem? Please describe. This is not a problem per se, but AgglomerativeClustering and SpectralClustering in sklearn.cluster is not always favorable especially for large datasets due to its numerical scaling (benchmark at HDBSCAN docs. For example, personally I usually use genieclust, and would like to use it instead of sklearn clusterers, which is impossible in the current implementation.

Describe the solution you'd like A Clusterer base class for interfacing both sklearn and other types of clusterers by inheritance can be implemented and its instance (or class itself) can be given as an argument while splitting. Or it can be some if-else statements in datasail.cluster.clustering.additional_clustering(), but it might be less elegant.

Describe alternatives you've considered Alternatively, sklearn clusterers can be replaced with ones from fastclust package.

Old-Shatterhand commented 1 month ago

Dear @isty2e,

Thank you for your feedback and suggestions. We will definitely consider these for future versions and improvements of DataSAIL. Customized clustering is indeed something we haven't thought about and implemented yet.

Best, Roman