scalanlp / nak

The Nak Machine Learning Library
Apache License 2.0
341 stars 83 forks source link

Improve k-means code. #helpwanted #10

Closed jasonbaldridge closed 11 years ago

jasonbaldridge commented 11 years ago

The current k-means implementation is something I did for homework assignments for teaching NLP courses at UT Austin. It can handle a fair amount, but it runs out of steam (in particular, memory) for larger datasets, especially if they have a lot of features. It currently uses dense vectors to represent the features for each data point, so it should be a fairly straightforward win to change this to use sparse vectors instead.

dlwh commented 11 years ago

As is my (bad) habit, the K-means(++) impl in breeze is generic on vector type, so can use SparseVectors.

-- David

On Tue, Apr 16, 2013 at 12:45 PM, Jason Baldridge notifications@github.comwrote:

The current k-means implementation is something I did for homework assignments for teaching NLP courses at UT Austin. It can handle a fair amount, but it runs out of steam (in particular, memory) for larger datasets, especially if they have a lot of features. It currently uses dense vectors to represent the features for each data point, so it should be a fairly straightforward win to change this to use sparse vectors instead.

— Reply to this email directly or view it on GitHubhttps://github.com/scalanlp/nak/issues/10 .

jasonbaldridge commented 11 years ago

Awesome. This may be sorted out directly as we transition things from Breeze then.