Closed kyleabeauchamp closed 10 years ago
Overall -1. This needs to be thought out more.
(1) Our clustering pipeline is conceptually structured around ideas of distance metrics, and that's totally abandoned here, since kmeans is only for euclidean. If nothing else, we should explain this at the script-level documentation and throw nice errors there, so users aren't only seeing the TypeErrors in a traceback. (2) Why are we using sklearn kmeans when sklearn isn't currently a dependency, and there is an implementation of kmeans available in scipy which is already a dependency?
(1). We can add enough error checking to make this work properly. We already have several choices of metrics and clustering algorithms that are incompatible, so there's no reason to reject kmeans for this reason alone. MSMBuilder3 will provide an opportunity for a clean rewrite of things.
(2). IMHO sklearn is probably better maintained and with a cleaner interface. I'm worried about the fact that clustering is buried deep within scipy and possibly maintained by people without domain expertise. We also currently have optional dependencies, so that's really nothing new.
I'm worried about the fact that clustering is buried deep within scipy and possibly maintained by people without domain expertise.
Really? You think there's a bug in scipy's kmeans? Do you actually have any evidence for that? I'm highly skeptical If there's a convincing reason that the sklearn implementation is actually better, (faster, more scalable), etc, I'm all ears. But we shouldn't make MSMBuilder even harder to install just because we prefer programming against the sklearn API.
We can switch to scipy, the choice of library doesn't really matter.
The point is that this is pretty easy to add and could be useful to some people.
Closing this temporarily, will be reappear soon.
We should probably support kmeans...