huonw / cogset

Generic implementations of clustering algorithms.
http://huonw.github.io/cogset/cogset
Apache License 2.0
20 stars 5 forks source link

Hierarchical bottom-up clustering #2

Closed aomader closed 5 years ago

aomader commented 9 years ago

This PR should work as common ground for discussion and thoughts on how to model the described topic appropriately.

aomader commented 9 years ago

I just implemented SLINK as default algorithm for single linkage. Furthermore I added some tests (using quicktest) and benchmarks to compare the SLINK implementation with the previous naive implementation.

Some bench results:

test hierarchical::benches::single_naive_d1_n0010 ... bench:      33,450 ns/iter (+/- 331)
test hierarchical::benches::single_naive_d1_n0100 ... bench:  37,539,939 ns/iter (+/- 177,819)
test hierarchical::benches::single_slink_d1_n0010 ... bench:       3,104 ns/iter (+/- 56)
test hierarchical::benches::single_slink_d1_n0100 ... bench:     122,957 ns/iter (+/- 1,208)
test hierarchical::benches::single_slink_d1_n1000 ... bench:  13,442,927 ns/iter (+/- 3,590,680)
aomader commented 9 years ago

@huonw I would call my implementation finished, so now is probably a good time to read it and provide feedback.

However, I'm going to improve the implementation over time, adding more methods and algorithms, but I would like to finish a bit so I can use this one in my own projects. So more PR will come over time.

I have no idea why the coveralls coverage decreased, since all my public items are properly documented. The Travis error on nightly is due to a compiler panic (see ticket).

aomader commented 9 years ago

@huonw May I ask what prevents this PR?

huonw commented 8 years ago

I've looked through pretty much all of it, except for the details of slink/clink/using_pointer_representation, and it seems pretty good. I've left some comments and would love to hear your thoughts on them (you're the expert, so be sure to tell me if anything is mistaken/irrelevant).

aomader commented 8 years ago

I think we are still missing one important use case: Finding the clusters with below a certain distance threshold. This doesn't require to compute the dendrogram at a whole, because we can stop prematurely. I think we should capture this in the API as well with the proper optimizations in place.

I'll write an addition for this one, this might change the main API of Agglomerative a bit.

aomader commented 8 years ago

What about the following interface. First of we have a trait for agglomerative clustering like

pub trait AgglomerativeClustering<'a, P: Point> {
    fn cluster(&'a self, threshold: f64) -> Vec<Vec<&'a P>>;
    fn compute_dendrogram(&'a self) -> Box<Dendrogram<&'a P>>;
}

which allows to directly cluster points (without requiring to compute the dendrogram in its entirety) and to compute the dendrogram, which is useful for visualizing or repeated clustering approach (I'm about to add a method to extract clusters given a dendrogram, which is much faster).

I transform the different linkage criteria into structs that implement the AgglomerativeClustering trait, i.e., SingleLinkage, CompleteLinkage and CLINK.

Furthermore, I would create CustomLinkage trait, which implements clusters() and dendrogram() but adds a new unimplemented method distance():

trait CustomLinkage {
    fn distance<F: Fn(usize, usize) -> f64>(&self, dist: F, a: &[usize], b: &[usize]) -> f64;
}

impl<T: CustomLinkage> AgglomerativeClustering for T {
    fn cluster(&'a self, threshold: f64) -> Vec<Vec<&'a P>> {
        // naive impl using self.distance()
    }

    fn compute_dendrogram(&'a self) -> Box<Dendrogram<&'a P>> {
        // naive impl using self.distance()
    }
}

What do you think? In my opinion this seems very clean and obvious as well as providing static dispatch, similar to your proposal with a LinkageCriterion trait.

For the CustomLinkage implementation I'm still missing a way to pass the points. I could add points() method to the CustomLinkage trait which my naive implementation could use, but this seems sub-optimal. Any ideas?

huonw commented 8 years ago

That's an interesting idea!

Maybe the points could be passed in from the outside? I.e. cluster and compute_dendrogram take &'a [P]? (This would possibly allow replacing the &'a self with &self or even &mut self.)

aomader commented 8 years ago

@huonw I implemented the new idea. What do you think? I do have some more ideas to simplify the code-base and to improve the performance. But since these changes shouldn't affect the public API I would like to do them in a second PR. So, does anything prevent you from merging this one?

aomader commented 8 years ago

@huonw ?