legel / words2map

online natural language processing with word vectors
http://web.archive.org/web/20160806040004if_/http://blog.yhat.com/posts/words2map.html
MIT License
310 stars 37 forks source link

Clustering in 2D - is that the best choice? #2

Open robinlabs opened 8 years ago

robinlabs commented 8 years ago

Question to Y-hat folks: why cluster in 2D? Granted, clustering in 300D is hard :) Still, the 2D projection must add a significant metric distortion. Why not a middle ground, say, 5-10D ? Have you tried that?

legel commented 8 years ago

Thanks @robinlabs, that's definitely a great question.

The short answer is you're right, 2D is not necessarily an optimum. It's clearly nice for data visualization, although 3D would probably be even cooler...

In any case, I haven't yet tried full HDBSCAN clustering in 300D, so that could be interesting to try out. It would also be interesting to consider if there's some way to measure a "maximum likelihood" value for D that balances preservation of information with suppression of noise in the derived 300D vectors. t-SNE helps to reduce the noise that naturally emerges when averaging 25 completely different vectors for keywords found online...

Definitely hope to improve this, and any ideas / contributions are welcome!

robinlabs commented 8 years ago

HDBSCAN may very well break in 300D, but 5-10D may be reasonable while forcing less metric distortion & still with quite a bit of noise suppression. If you do try that, would be interesting to know the results!

On Tue, Jul 26, 2016 at 1:52 PM, Lance Legel notifications@github.com wrote:

Thanks @robinlabs https://github.com/robinlabs, that's definitely a great question.

The short answer is you're right, 2D is not necessarily an optimum. It's clearly nice for data visualization, although 3D would probably be even cooler...

In any case, I haven't yet tried full HDBSCAN clustering in 300D, so that could be interesting to try out. It would also be interesting to consider if there's some way to measure a "maximum likelihood" value for D that balances preservation of information with suppression of noise in the derived 300D vectors. t-SNE helps to reduce the noise that naturally emerges when averaging 25 completely different vectors for keywords found online...

Definitely hope to improve this, and any ideas / contributions are welcome!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/overlap-ai/words2map/issues/2#issuecomment-235401229, or mute the thread https://github.com/notifications/unsubscribe-auth/AE6d3mMCZYIw9hc9kEHmm0oHaKHqrwxTks5qZnOqgaJpZM4JVj5y .

-- Ilya Eckstein, PhD cofounder / CEO @ Robin Labs 650-223-5797 www.robinlabs.com http://www.robinlabs.com

legel commented 8 years ago

Definitely!

I suspect at some point 3D HDBSCAN is going to be awesome to set up (probably when we're hooking up an internal dashboard for overlap.ai) and around that time I'll do a check on all this and report back.