scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 496 forks source link

Algorithm parameter "best" option? #83

Open DataWaveAnalytics opened 7 years ago

DataWaveAnalytics commented 7 years ago

Hi,

According to the API documentation, the algorithm parameter is set as follows:

algorithm : string, optional (default=’best’) Exactly which algorithm to use; hdbscan has variants specialised for different characteristics of the data. By default this is set to best which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:

  • best
  • generic
  • prims_kdtree
  • prims_balltree
  • boruvka_kdtree
  • boruvka_balltree

Is there any available comparison of the options with different real datasets?

Is there an explanation for?

By default this is set to best which chooses the “best” algorithm given the nature of the data.

Thank you for the contribution

lmcinnes commented 7 years ago

Actually no, I don't have a good comprehensive comparison, and right now the 'best' option is a heuristic based on some (not ready for publication) grid-search style comparisons between the different approaches. The 'best' option should always exist (and does for similar sklearn classes), but exactly how to do it is another thing. I would be exceedingly happy if you wanted to do such a comprehensive comparison, and I would be more than happy to add it to the official documentation, as well as using it to better define the selection process when 'best' is selected.

On Sun, Jan 22, 2017 at 11:31 PM, Claudio Sanhueza <notifications@github.com

wrote:

Hi,

According to the API documentation, the algorithm parameter is set as follows:

algorithm : string, optional (default=’best’) Exactly which algorithm to use; hdbscan has variants specialised for different characteristics of the data. By default this is set to best which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:

  • best
  • generic
  • prims_kdtree
  • prims_balltree
  • boruvka_kdtree
  • boruvka_balltree

Is there any available comparison of the options with different real datasets?

Is there an explanation for?

By default this is set to best which chooses the “best” algorithm given the nature of the data.

Thank you for the contribution

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/83, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBeM3OxTCElWwwhgoLuiaXMLNRuEnks5rVC0dgaJpZM4Lqp15 .