psaux / berkeleylm

Automatically exported from code.google.com/p/berkeleylm
1 stars 0 forks source link

How to train on Google n-grams #5

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I see the example file for training on the Google n-grams.
However, I don't know how the Google n-gram directory should be laid out.

What directory structure should I have?
This is how I currently have things laid out:
.
./web_5gram_2
./web_5gram_2/data
./web_5gram_2/data/3gms
./web_5gram_2/data/4gms
./web_5gram_2/docs
./web_5gram_v1_1.btw
./web_5gram_v1_1.btw/data
./web_5gram_v1_1.btw/data/1gms
./web_5gram_v1_1.btw/data/2gms
./web_5gram_v1_1.btw/data/3gms
./web_5gram_v1_1.btw/docs
./web_5gram_4
./web_5gram_4/data
./web_5gram_4/data/4gms
./web_5gram_4/data/5gms
./web_5gram_4/docs
./web_5gram_5
./web_5gram_5/data
./web_5gram_5/data/5gms
./web_5gram_5/docs
./web_5gram_6
./web_5gram_6/data
./web_5gram_6/data/5gms
./web_5gram_6/docs
./web_5gram_3
./web_5gram_3/data
./web_5gram_3/data/4gms
./web_5gram_3/docs

From looking at src/edu/berkeley/nlp/lm/io/GoogleLmReader.java
it seemed that I should make one directory, alldata/, and put every data file 
in there. However, this didn't work either.

What is the correct way to lay out the ngram directory?

Original issue reported on code.google.com by tur...@gmail.com on 19 Nov 2011 at 11:55

GoogleCodeExporter commented 9 years ago
Hi, Joseph, thanks for your question.

The format should be as in the example directory: 
test/edu/berkeley/nlp/lm/io/googledir

Specifically, the directory look like

# 1gms/vocab_cs.gz [here, vocab_cs.gz should have the unigram frequencies 
sorted in decreasing order of frequency]
# 2gms/2gm-0001.gz 2gm-0002.gz …
# 3gms/3gm-0001.gz 3gm-0002.gz … 
# ...

Given your directory structure, you will need create an [n]gms directory for 
n=1..5, and then copy/soft-link all files for each order to the corresponding 
[n]gms directory. You might also need to create the vocab_cs.gz by sorting the 
unigram file, though this comes with at least the English distribution (in 
1gms). 

I have added additional documentation about this to the example script for the 
next release. 

Original comment by adpa...@gmail.com on 20 Nov 2011 at 5:00

GoogleCodeExporter commented 9 years ago

Original comment by adpa...@gmail.com on 20 Nov 2011 at 5:00

GoogleCodeExporter commented 9 years ago
Thanks, that worked great.

Original comment by tur...@gmail.com on 24 Nov 2011 at 2:56