keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.71k stars 19.43k forks source link

Is Word Clustering Possible In Keras? #840

Closed NickShahML closed 8 years ago

NickShahML commented 8 years ago

Hey everyone,

I've been trying to figure out a way to cluster words based upon similarity.

Suppose you read many books that total to 100k different words used. It would be great if you could make ~1000 clusters with approx 100 words/cluster. In each cluster, words are similar to each other. "Dog" and "Cat" in one cluster and "truck" and "car" in a different cluster.

I saw that there's the well-made skipgram word-embedding script example: https://github.com/fchollet/keras/blob/master/examples/skipgram_word_embeddings.py

And I also saw that word2vec has made word clusters: https://code.google.com/p/word2vec/

I know that they usually apply a k-means on top of of the word vectors created. I thought it would be good to start a discussion about this in case other keras-users are interested in the same thing.

Thanks!

jmhessel commented 8 years ago

This might be more appropriate for the keras-users google group, but a few observations:

Most tasks like the one you describe end up reducing to a factorization of some co-occurrence matrix (either word-word in a context window (this gives word2vec) or a word-document across the whole thing (this gives a topic model)).

I would recommend trying something like latent dirichlet allocation before neural network methods. Here's a cool browser plugin that runs on state of the union addresses. http://mimno.infosci.cornell.edu/jsLDA/

NickShahML commented 8 years ago

Thank you @jmhessel , I will repost this on the keras users group. I appreciate the plugin.

Good to know about LDA in vs. neural net methods. Definitely going to check out that plugin!

srilekha1993 commented 6 years ago

can anyone implement RBM using keras and share the code