rlebret / hpca

C++ implementation of the Hellinger PCA for computing word embeddings.
GNU General Public License v3.0
32 stars 5 forks source link

hpca is a C++ toolkit providing an efficient implementation of the Hellinger PCA for computing word embeddings. See the EACL 2014 paper for more details.

PREREQUISITES

This project requires:

BUILDING

This project uses the Cross-platform Make (CMake) build system. However, we have conveniently provided a wrapper configure script and Makefile so that the typical build invocation of ./configure followed by make will work. For a list of all possible build targets, use the command make help.

NOTE: Users of CMake may believe that the top-level Makefile has been generated by CMake; it hasn't, so please do not delete that file.

INSTALLING

Once the project has been built (see "BUILDING"), execute sudo make install.

See Install for more details.

GETTING WORD EMBEDDINGS

This package includes 9 different tools: preprocess, vocab, stats, cooccurrence, pca, embeddings, inference, eval and neighbors.

Corpus preprocessing

Lowercase conversion and/or all numbers replaced with a special token ('0').

The corpus needs to be a tokenized plain text file containing only the sentences of the corpus.

Before running the preprocess tool, authors strongly recommend to run a tokenizer, e.g. the Stanford Tokenizer.

java -cp stanford-parser.jar edu.stanford.nlp.process.PTBTokenizer -preserveLines corpus-sentences.txt > corpus-token.txt

preprocess options:

Example:

preprocess -input-file corpus-token.txt -output-file corpus-clean.txt -lower 1 -digit 1 -verbose 1 -threads 8 -gzip 0

Vocabulary extraction

Extracting words with their respective frequency.

vocab options:

Example:

vocab -input-file corpus-clean.txt -vocab-file vocab.txt -threads 8 -verbose 1

Corpus statistics

Outputting descriptive statistics about the corpus, such as the number of word types and their probability of occurrence. This tool is helpful to define the context vocabulary before constructing the co-occurrence matrix.

stats options:

Example:

stats -vocab-file vocab.txt

Getting co-occurrence probability matrix

Constructing word-word cooccurrence statistics from the corpus. The user should supply a vocabulary file, as produced by vocab. The context vocabulary can be defined either using bounds on word appearance frequencies or using a predefined context vocabulary.

cooccurrence options:

Example:

cooccurrence -input-file corpus-clean.txt -vocab-file vocab.txt -output-dir path_to_dir -min-freq 100 -cxt-size 5 -dyn-cxt 1 -memory 4.0 -upper-bound 1.0 -lower-bound 0.00001 -verbose 1 -threads 8

cooccurence will create the following files into the directory specified by the -output-dir option:

Performing Hellinger PCA

Randomized SVD with respect to the Hellinger distance. The user should supply the directory where files produced by cooccurrence are.

Let A be a sparse matrix to be analyzed with n rows and m columns, and r be the ranks of a truncated SVD (with r < min(n,m)). Formally, the SVD of A is a factorization of the form A = U S Vᵀ.

Unfortunately, computing the SVD can be extremely time-consuming as A is often a very large matrix. Thus, we turn to randomized methods which offer significant speedups over classical methods. This tool use some modern randomized matrix approximation techniques, developed in (amongst others) Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, a 2009 paper by Nathan Halko, Per-Gunnar Martinsson and Joel A. Tropp.

This tool uses the external redsvd library, which implements this randomized SVD using Eigen3.

pca options:

Example:

pca -input-dir path_to_cooccurence_file -rank 300

pca will create the following files into the directory specified by the -input-dir option:

Extracting word embeddings

Generating word embeddings from the Hellinger PCA. The user should supply the directory where files produced by pca are.

embeddings options:

Example:

embeddings -input-dir path_to_svd_files -output-name words.txt -eig 0.0 -dim 100 -norm 0

Inferring new word embeddings

Inferring new word embeddings from an existing Hellinger PCA. The user should supply the directories where to find cooccurrence statistics of the new words and files produced by pca.

inference options:

Example:

inference -cooc-dir path_to_cooccurrence_files -pca-dir path_to_svd_files -output-name inference_words.txt -eig 0.0 -dim 100 -norm 0

Evaluating word embeddings

This tool provides a quick evaluation of the word embeddings produced by embeddings for an English corpus. Console output can be redirected to a file.

It contains the following evaluation datasets:

eval options:

Example:

eval -word-file words.txt -vocab-file target_vocab.txt -ws353 1 -rg65 1 -rw 1 -syn 1 -sem 1 -verbose 1 > words-eval.txt

NOTE: To speed up the implementation of the analogies, candidate solutions come from a closed vocabulary.

Computing word embeddings nearest neighbors

An exploratory tool to evaluate word embeddings quality. The user should supply the file containing the word embeddings and its corresponding vocabulary. By default, this tool runs in interactive mode. Otherwise, a file containing a list of words can be provided.

neighbors options:

Example:

neighbors -word-file words.txt -vocab-file target_vocab.txt -list-file words_list.txt -top 5 > nearest_neighbors.txt

FULL EXAMPLE

For a full demo example, run:

./demo.sh

This script will download a tokenized version of the Reuters Corpus Volume I (RCV1) and compute word embeddings out of it.

AUTHORS

ACKNOWLEDGEMENTS