The software can be used to compute a Generalized Language Model which is yet another mean to compute a Language Model. As shown in this publication Generalized Language models can outperform Modified Kneser Ney Smoothing by 10 to 25 % in Terms of perplexity.
git clone git@github.com:renepickhardt/generalized-language-modeling-toolkit.git
sudo chmod a+x mvn.sh
You will need to install maven in order to build the project.
sudo apt-get install maven2
You need to copy config.sample.txt to config.txt and read the instructions in config.sample.txt.
cp config.sample.txt config.txt
emacs config.txt
After you set all your directories in config.txt you can run the project
./mvn.sh
Since Generalized language models can become very large the software is written to use the hard disk. In this sense you can theoretically run the programm with very little memory. Still we recommend 16 GB of main memory for the large english wikipedia data sets.
We tried to avoid frequent disc hits. Still the programm will execute much faster if you store your data on a Solid State disk.
you need to have a file called normalized.txt
which serves as your input. This file should contain one sentence per line. You will learn language models based on this file.
Please refere to http://glm.rene-pickhardt.de/data in order to download preprocessed and formatted data sets.
If you whish to parse the data yourself (e.g. because you want to use a newer wikipedia dump) refer to https://github.com/mkrnr/lexer-parser
you have to start with a file called normalized.txt
which has to be stored in your data directory (according to config.txt
). mvn.sh
will compile the program and start the flow of the following steps (which can be configured by switching the fields ind config.txt
from true
to false
)
normalized.txt
to training.txt
and testing.txt
according to the datasplit parameters in config.txt
index.txt
this index is used to split the language models into files of equal sizeabsolute
and continuation
** the various models are stored in folders like 11111
meaning a regular 5 gram or 11011
meaning a skipped 5 gram at the third positiontesting.txt
: testing-samples-4.txt
for example contains about 100k sequences of 4 words to be testedmod-kneser-ney-complex-backoffToCont-3.txt
: depending on your configuration the files could be named with a simple
instead of complex
(complex meaning GLM, simple meaning LM). Exchanging the 3
you can have different model lenghts. These files contain the testing samples with the log of their probabilities.mod*.txt
: in this way you can calculate the entropy for all files and experiments.If this software or data is of any help to your research please be so fair and cite the original publication which is also in the home directory of [this git repository](https://github.com/renepickhardt/generalized-language-modeling-toolkit/raw/master/A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser-Ney Smoothing.pdf). You might want to use the following bibtex entry:
@inproceedings{Pickhardt:2014:GLM,
author = {Pickhardt, Rene and Gottron, Thomas and Körner, Martin and Wagner, Paul Georg and Speicher, Till and Staab, Steffen},
title = {A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser Ney Smoothing},
year = {2014},
booktitle = {ACL'14: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics},
}
The Generalized Language models envolved from Paul Georg Wagner's and Till Speicher's Young Scientists project called Typology which I advised in 2012. The Typology project played around and evaluated an idea I had (inspired by the PhD thesis of Adam Schenker) of presenting text as a graph in which the edges would encode relationships (nowerdays known as skipped bi-grams). The Graph was used to produce an answer to the next word prediction problem applied to word suggestions in keyboards of modern smartphones. From the convincing results I developed the theory of Generalized Language models. Most of the Code was written by my student assistent Martin Körner who also created his bachlor thesis about the implementation of a preliminary vesion of the Generalized Language Models. This thesis is a nice reference if you want to get an understanding of modified kneser ney smoothing for standard language models. In terms of notation and building of generalized language models it is outdated.
If you have questions feel free to contact me via the issue tracker. on my blog or in the paper you could find my mail address.