nltk / nltk

NLTK Source
https://www.nltk.org
Apache License 2.0
13.53k stars 2.88k forks source link

word embedding #798

Closed stevenbird closed 9 years ago

stevenbird commented 9 years ago

Implement some word embedding algorithms

dimazest commented 9 years ago

Hi,

In my experience, a typical word vector workflow (it doesn't really matter whether it's distributional (based on co-occurence counts) or distributed (learned) vectors look like this.

  1. Decide the source corpus to build vectors. I use BNC and BNCReader fot this (it's common to use Wikipedia or ukwac as far as i know, there are no readers for them).
  2. Decide the target and context (in the distributional approach) words. Target words are the words for which vectors are built, context words label vector dimensions.
    • It's common to take the N most frequent words as context words, this information is taken from the corpus. Again, I don't know if there is a common approach to get such information.
    • Words might be POS tagged.
    • Sometimes depenencies are taken into account.
  3. Extract co-occurrence statistics, or learn the model.
    • Here is how I do it. I've found Pandas very efficient and easy to use in this kind of data processing.
    • It's not clear what is the best format to store spaces, as they become pretty large. I store spaces in hdf files using Pandas, the load pretty fast. Efficient data format is essential because it takes quite a lot of time to load word2vec published vectors using gensim.
  4. Weight the model. scikit-learn provides most of the methods used in the related literature. Here is how I weight the space.
  5. Perform experiments. There are many options to compute similarity (e.g. use cosine similarity or correlation) again, scipy and scikit-learn provides most of the functionality. Also, there are several common datasets word vectors are evaluated on, so it would be nice to have an interface to them as well.

To sum up. Word embeddings require fast IO operations. I don't think pickle would be sufficient (mainly because the various versions and now it's common for researchers to 'release' their vectors). It would be cool, if NLTK provided basic functionality and an API to

On the other side, NLTK should not reimplement what is already implemented in scikit-learn, scipy, gensim and dissect. Probably, providing a unified API to them is a good idea.

P.S. I would be happy to work on this, and move relevant code from fowler.corpora to NLTK. P.P.S. https://github.com/maciejkula/glove-python reuires gcc 4.9!

alvations commented 9 years ago

gensim has a range of very good vector space tools. And i'm also working on python based neural net with as little dependencies as possible. Currently, i think i am depending on scipy and numpy and I want to cut scipy out because of dependencies.

What if we "port" code from gensim? Would that be okay?

kmike commented 9 years ago

Gensim is good! -1 for "porting" gensim code because it means more code to maintain for NLTK for no clear benefit.

dimazest commented 9 years ago

Agree, that it doesn't worth of reimplementing gensim inside of NLTK.

However, would it make sense to implement functionality to perform word similarity tasks (for example, wordsim353) in NLTK. Should nltk then depend on gensim?

Gensim's Dictionary is useful to decide what context words should be to build co-occurrence based vectors. But I didn't find code that builds a co-occurrence matrix.

it would be nice to do something like this:

>>> corpus = BNCReader(...)
>>> space = CooccurrenceSpace(
...    corpus,
...    targets=['car', 'bus'],
...    contexts=corpus.most_frequent_words(2000),
...    weighting='ppmi',
... )

>>> space.similairty('car', 'bus', measure=np.cos)
0.82
alvations commented 9 years ago

CooccurenceSpace sounds a little weird but i get the idea.

Is the assumption for the space plain Bag of Words? Or should the module allow different space of various information and density? E.g. Skip-grams, or contextual BoW (i'm not sure whether c in cbow means contextual but it does use the context window)?

A good thing is that we don't need compression modules since the recent study has shown that no compression works best but it's a matter of 300k dimensions vs 500 dimensions (users need bigger than laptop machines to train the space). see http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf .

From a user perspective, my code to do vector space related projects uses a mish-mash of nltk + gensim + scikit. I'm not sure whether it's the best practice but it's the fastest way to churn models out.

dimazest commented 9 years ago

Class name doesn't matter that much. The main idea is to make it easy to build spaces used in the literature. It would be cool to have a common API that allows to build common distributional (distributed) models.

alvations commented 9 years ago

At some point, we should make the module scale to this: http://www.cs.bgu.ac.il/~yoavg/publications/acl2014syntemb.pdf

longdt219 commented 9 years ago

Hi all, Here is some though

(1) Using word2vec through gensim is a good start. However, having common API for word representation either distributional or distributed is crucial since this is the main contribution. There is a big comparison of word embedding methods at wordvectors.org/index.php

(2) We can incorporate some pre-trained english model to our corpora. (3) Allow to load external word embedding (binary and text format) (4) Allow to train the model using various configuration. (5) Allow to save in binary / text format (currently gensim use pickle) (6) Cross-lingual word embedding is a possible extension (http://arxiv.org/pdf/1404.4641.pdf) (7) Go beyond word: phrase embedding, sentence embedding, document embedding. Which is already implemented in gensim (http://radimrehurek.com/2014/12/doc2vec-tutorial/)

stevenbird commented 9 years ago

@longdt219 is there a good default pre-trained model that we could include with NLTK-data? I can add it, and then you can provide sample code for using it: load the model, then given some input, find the closest word, or compare two words, etc.

stevenbird commented 9 years ago

@longdt219 can we please have an example based on the word2vec tutorial. Create a good model using one of the word2vec datasets, and prune the less-frequent words so that the model size is less than 100MB. Then we need the code for training this, and for using it to do one of the standard word similarity tasks.

longdt219 commented 9 years ago

Created PR #971

stevenbird commented 9 years ago

Sorry not to be clearer. The best way to provide an example is to create a file nltk/test/gensim.doctest. Then you can explain the steps in prose, and provide doctest examples etc. The result will then appear at: http://www.nltk.org/howto/

@longdt219 would you please rework your example to be a doctest file instead of a package?

longdt219 commented 9 years ago

Sure @stevenbird, Do you think we should provide the visualization of the embedding. It would be a cool things to do. t-SNE is a package that most people use and it's free for non-commercial purposes.

stevenbird commented 9 years ago

@longdt219 that sounds like a good thing to include, thanks. Note that the word embedding data is now in the repository, but with a more specific name; please see https://github.com/nltk/nltk_data/commit/a6db6934fdebbee698f077f203d9b204ae351934

longdt219 commented 9 years ago

Hi @stevenbird , Could you close this issue ? Resolve in PR #971

stevenbird commented 9 years ago

Thanks @longdt219