lbehnke / hierarchical-clustering-java

Implementation of an agglomerative hierarchical clustering algorithm in Java. Different linkage approaches are supported.
141 stars 79 forks source link

weighted average linking strategy #9

Closed tyrcho closed 9 years ago

tyrcho commented 9 years ago

Hi,

I'd like to use a new linking strategy which takes into account another dimension : the weight of elements I'm trying to sort into clusters.

So I'd need to add a new array together with the elements with their weights and a new Strategy which would make an weighted average of the distances. This also means that the weights need to be carried together with the clusters and passed to the strategy.

What's you advice ? Is it something compatible with what you already implemented ? Would you accept a PR with an implementation of that ?

My use case : I'm trying to sort out decks in card games (magic, hearthstone) into archetypes, and I'd like to take into account the popularity of the decks.

lbehnke commented 9 years ago

Hi tyrcho, so far I've not thought about implementing the average linking strategy using a weighted pair group method (because i don't need it). But if you manage to provide the implementation of such linkage strategy (and unit tests) I would certainly accept a pull request. The main problem with the current software design is probably the LinkageStrategy interface which would need to change. Hence this extension will be quite invasive. But this is rather a weakness of the current design. So go for it!

tyrcho commented 9 years ago

Thanks for the quick reply ! I gave a shot at the implementation but I realized it's going to be more complex than I had in mind.

Adding weights to elements adds more than just weights to distance (which doesn't mean much). Actually it means I need to keep track of the coordinates of all my elements instead of just the distance matrix, and build the barycentre of the elements when I put them in a cluster. It would change a LOT of the current interfaces ...

I'll keep you posted if I have something concrete.

Edit : after second thought, maybe it's not that bad ... I wonder if the distance of the new cluster to other elements can be considered as the weighted average of distances from the 2 agglomerated elements. I still need to check it works regardless of dimensions (in my context I'll have up to several hundred dimensions), my math skills are a bit rusty ...

lbehnke commented 9 years ago

Seems to work nicely. I merged dev branch into master.

lbehnke commented 9 years ago

Created new release 1.1.0 and published it on maven central.