maxoodf / word2vec

word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch
Apache License 2.0
131 stars 24 forks source link
c-plus-plus doc2vec doc2vec-word2vec machine-learning machine-learning-algorithms ml nlp nlp-machine-learning word2vec word2vec-algorithm word2vec-en word2vec-model word2vec-ru

word2vec++

Introduction

word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation.

Building

You will need C++11 compatible compiler and cmake 3.1 or higher. Execute the following commands:

  1. git clone https://github.com/maxoodf/word2vec.git word2vec++
  2. cd word2vec++
  3. mkdir build
  4. cd build
  5. cmake -DCMAKE_BUILD_TYPE=Release ../
  6. make
  7. cd ../

On successful build you will find compiled tools in ./bin directory, libword2vec.a in ./bin/lib and examples in ./bin/examples.

Training the model

Training utility name is w2v_trainer and you can find it at the project's bin directory. Execute ./w2v_trainer without parameters to output a brief help information. The following training parameters are available.

For example, train the model from corpus.txt file and save it to model.w2v. Use Skip-Gram, Negative Sampling with 10 examples, vector size 500, downsampling threshold 1e-5, 3 iterations, all other parameters by default:
./w2v_trainer -f ./corpus.txt -o ./model.w2v -g -n 10 -s 500 -l 1e-5 -i 3

Basic usage

You can download one or more models (833MB each) trained on 11.8GB English texts corpus:

Implementation improvements VS original C code