Closed stevenbird closed 9 years ago
Hi,
In my experience, a typical word vector workflow (it doesn't really matter whether it's distributional (based on co-occurence counts) or distributed (learned) vectors look like this.
To sum up. Word embeddings require fast IO operations. I don't think pickle would be sufficient (mainly because the various versions and now it's common for researchers to 'release' their vectors). It would be cool, if NLTK provided basic functionality and an API to
On the other side, NLTK should not reimplement what is already implemented in scikit-learn, scipy, gensim and dissect. Probably, providing a unified API to them is a good idea.
P.S. I would be happy to work on this, and move relevant code from fowler.corpora
to NLTK.
P.P.S. https://github.com/maciejkula/glove-python reuires gcc 4.9!
gensim
has a range of very good vector space tools. And i'm also working on python based neural net with as little dependencies as possible. Currently, i think i am depending on scipy and numpy and I want to cut scipy out because of dependencies.
What if we "port" code from gensim? Would that be okay?
Gensim is good! -1 for "porting" gensim code because it means more code to maintain for NLTK for no clear benefit.
Agree, that it doesn't worth of reimplementing gensim inside of NLTK.
However, would it make sense to implement functionality to perform word similarity tasks (for example, wordsim353) in NLTK. Should nltk then depend on gensim?
Gensim's Dictionary
is useful to decide what context words should be to build co-occurrence based vectors. But I didn't find code that builds a co-occurrence matrix.
it would be nice to do something like this:
>>> corpus = BNCReader(...)
>>> space = CooccurrenceSpace(
... corpus,
... targets=['car', 'bus'],
... contexts=corpus.most_frequent_words(2000),
... weighting='ppmi',
... )
>>> space.similairty('car', 'bus', measure=np.cos)
0.82
CooccurenceSpace
sounds a little weird but i get the idea.
Is the assumption for the space plain Bag of Words
? Or should the module allow different space of various information and density? E.g. Skip-grams
, or contextual BoW
(i'm not sure whether c in cbow means contextual but it does use the context window)?
A good thing is that we don't need compression modules since the recent study has shown that no compression works best but it's a matter of 300k dimensions vs 500 dimensions (users need bigger than laptop machines to train the space). see http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf .
From a user perspective, my code to do vector space related projects uses a mish-mash of nltk + gensim + scikit. I'm not sure whether it's the best practice but it's the fastest way to churn models out.
Class name doesn't matter that much. The main idea is to make it easy to build spaces used in the literature. It would be cool to have a common API that allows to build common distributional (distributed) models.
At some point, we should make the module scale to this: http://www.cs.bgu.ac.il/~yoavg/publications/acl2014syntemb.pdf
Hi all, Here is some though
(1) Using word2vec through gensim is a good start. However, having common API for word representation either distributional or distributed is crucial since this is the main contribution. There is a big comparison of word embedding methods at wordvectors.org/index.php
(2) We can incorporate some pre-trained english model to our corpora. (3) Allow to load external word embedding (binary and text format) (4) Allow to train the model using various configuration. (5) Allow to save in binary / text format (currently gensim use pickle) (6) Cross-lingual word embedding is a possible extension (http://arxiv.org/pdf/1404.4641.pdf) (7) Go beyond word: phrase embedding, sentence embedding, document embedding. Which is already implemented in gensim (http://radimrehurek.com/2014/12/doc2vec-tutorial/)
@longdt219 is there a good default pre-trained model that we could include with NLTK-data? I can add it, and then you can provide sample code for using it: load the model, then given some input, find the closest word, or compare two words, etc.
@longdt219 can we please have an example based on the word2vec tutorial. Create a good model using one of the word2vec datasets, and prune the less-frequent words so that the model size is less than 100MB. Then we need the code for training this, and for using it to do one of the standard word similarity tasks.
Created PR #971
Sorry not to be clearer. The best way to provide an example is to create a file nltk/test/gensim.doctest
. Then you can explain the steps in prose, and provide doctest examples etc. The result will then appear at: http://www.nltk.org/howto/
@longdt219 would you please rework your example to be a doctest file instead of a package?
Sure @stevenbird, Do you think we should provide the visualization of the embedding. It would be a cool things to do. t-SNE is a package that most people use and it's free for non-commercial purposes.
@longdt219 that sounds like a good thing to include, thanks. Note that the word embedding data is now in the repository, but with a more specific name; please see https://github.com/nltk/nltk_data/commit/a6db6934fdebbee698f077f203d9b204ae351934
Hi @stevenbird , Could you close this issue ? Resolve in PR #971
Thanks @longdt219
Implement some word embedding algorithms