.. -- coding: utf-8; --
Loaders and savers for different implentations of word embedding <https://en.wikipedia.org/wiki/Word_embedding>
_. The motivation of this project is that it is cumbersome to write loaders for different pretrained word embedding files. This project provides a simple interface for loading pretrained word embedding files in different formats.
.. code:: python
from word_embedding_loader import WordEmbedding
wv = WordEmbedding.load('path/to/embedding.bin')
print wv.vectors[wv.vocab['is']]
wv.save('path/to/save.txt', 'word2vec', binary=False)
This project currently supports following formats:
GloVe <https://nlp.stanford.edu/projects/glove/>
_, Global Vectors for Word Representation, by Jeffrey Pennington, Richard Socher, Christopher D. Manning from Stanford NLP group.word2vec <https://code.google.com/archive/p/word2vec/>
_, by Mikolov.
-binary 0
option (the default))-binary 1
option)gensim <https://radimrehurek.com/gensim/>
_ 's models.word2vec
module (coming)Sometimes, you want combine an external program with word embedding file of your own choice. This project also provides a simple executable to convert a word embedding format to another.
.. code:: bash
word-embedding-loader convert -t glove test/word_embedding_loader/word2vec.bin test.bin
word-embedding-loader --help word-embedding-loader convert --help
This project does decode vocab. It is up to users to determine and decode bytes.
.. code:: python
decoded_vocab = {k.decode('latin-1'): v for k, v in wv.vocab.iteritems()}
.. notes::
Encoding of pretrained word2vec is latin-1. Encoding of pretrained glove is utf-8
This project us Cython to build some modules, so you need Cython for development.
pip install -r requirements.txt
If environment variable DEVELOP_WE
is set, it will try to rebuild .pyx
modules.
DEVELOP_WE=1 python setup.py test