pauldb89 / OxLM

OxLM: Oxford Neural Language Modelling Toolkit
http://www.clg.ox.ac.uk/
39 stars 7 forks source link

Using OxLM with Moses #1

Closed lvapeab closed 9 years ago

lvapeab commented 9 years ago

Hi there,

I'm trying to use the OxLM as a feature in the Moses decoder. I compiled Moses with the --with-oxlm option and specified the OxLM feature in the moses config file. The problem comes up when I launch the MERT script. It loads the models but when it reads the phrase table, it crashes, with the following error:

Reading <moses-working-dir>/filtered/phrase-table.0-0.1.1.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************Cache file not found
moses: <path-to-oxlm>/src/third_party/eigen/Eigen/src/Core/Block.h:118: Eigen::Block<XprType,
BlockRows, BlockCols, InnerPanel>::Block(XprType&, Eigen::Block<XprType, BlockRows, 
BlockCols, InnerPanel>::Index) [with XprType = const Eigen::Map<Eigen::Matrix<float, -1, -1>, 0, 
Eigen::Stride<0, 0> >; int BlockRows = -1; int BlockCols = 1; bool InnerPanel = true; 
Eigen::Block<XprType, BlockRows, BlockCols, InnerPanel>::Index = long int]: Assertion `(i>=0) && ( 
((BlockRows==1) && (BlockCols==XprType::ColsAtCompileTime) && i<xpr.rows()) ||
((BlockRows==XprType::RowsAtCompileTime) && (BlockCols==1) && i<xpr.cols()))' failed.

I think I'm following the right steps. Do you have any clue about the crash?

Thank you very much.

pauldb89 commented 9 years ago

The error you describe is Eigen's way of catching runtime segmentation faults due to accessing out of range indexes. Since the error is quite generic, I can't pinpoint the problem.

Can you paste the commands you ran before getting this error and what your configuration files looked like? Thanks!

lvapeab commented 9 years ago

For training the translation model:

/home/alvaro/smt/software/mosesdecoder/scripts/training/train-model.perl \
-external-bin-dir /home/alvaro/smt/software/mosesdecoder/bin/training-tools \
-mgiza  \
-root-dir /home/alvaro/smt/tasks/turista/MOSES/mosesgit \
-corpus /home/alvaro/smt/tasks/turista/MOSES/corpus/turista -f es -e en \
-alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/alvaro/smt/tasks/turista/MOSES/LM/en.5.lm 

Launching MERT:

/home/alvaro/smt/software/mosesdecoder/scripts/training/mert-moses.pl \
/home/alvaro/smt/tasks/turista/MOSES/corpus/turista.dev.es \     
/home/alvaro/smt/tasks/turista/MOSES/corpus/turista.dev.en \
/home/alvaro/smt/software/mosesdecoder/bin/moses \
/home/alvaro/smt/tasks/turista/MOSES/mosesgit/model/mosesOxLM.ini \
--mertdir /home/alvaro/smt/software/mosesdecoder/bin \
--rootdir /home/alvaro/smt/software/mosesdecoder/scripts \
--working-dir=/home/alvaro/smt/tasks/turista/MOSES/mosesgit/mert-OxLM 

And finally, the mosesOxLM.ini config file looks like:

#########################
### MOSES CONFIG FILE ###
#########################

# input factors
[input-factors]
0

# mapping steps
[mapping]
0 T 0

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/home/alvaro/smt/tasks/turista/MOSES/mosesgit/model/phrase-table.gz input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home/alvaro/smt/tasks/turista/MOSES/mosesgit/model/reordering-table.wbe-msd-bidirectional-fe.gz
Distortion
SRILM name=LM0 factor=0 path=/home/alvaro/smt/tasks/turista/MOSES/LM/en.5.lm order=5
OxFactoredMaxentLM name=LM1 path=/home/alvaro/smt/tasks/turista/MOSES/LM/en.5.OxLM.bin order=5

# dense weights for feature functions
[weight]
UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5
LM1= 0.5

Thanks!

pauldb89 commented 9 years ago

Can you also copy & paste the commands you used to train the neural language model (including any commands for preparing the data) and the oxlm configuration file? It might make my job a bit easier.

lvapeab commented 9 years ago

I was testing the software on a toy task (around 10K sentences, vocabulary size ~500 and no OOV words in development data), therefore the commands for replacing rare words or OOV words by <UNK>are not needed. The corpus is tokenized by the Moses tokenizer tool:

~/smt/software/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en  < /home/alvaro/smt/tasks/turista/DATA/e-tr >/home/alvaro/smt/tasks/turista/DATA/e-train

I used Brown clustering:

./wcluster --c 66 --text /home/alvaro/smt/tasks/turista/DATA/e-train --output_dir /home/alvaro/oxLM/clusters

My configuration file, oxLMMaxEnt.ini , for the neural LM was:

iterations=10
minibatch-size=100
lambda-lbl=2
word-width=200
step-size=0.06
order=5
randomise=true
diagonal-contexts=true
activation=2

class-file=/home/alvaro/oxLM/clusters/paths

feature-context-size=5
min-ngram-freq=1
filter-contexts=true

input=/home/alvaro/smt/tasks/turista/DATA/e-train
test-set=/home/alvaro/smt/tasks/turista/DATA/e-dev

Finally, I trained the language model with:

/home/alvaro/smt/software/OxLM/bin/train_maxent_sgd -c /home/alvaro/oxLM/oxLMMaxEnt.ini \
--model-out /home/alvaro/smt/tasks/turista/MOSES/LM/en.5.OxLM.bin
pauldb89 commented 9 years ago

In general, it's not recommended to skip the step where rare words are replaced with <unk>. The goal of this step is not only to reduce the vocabulary to make training manageable, but also to enable us to learn a distributed representation for unknown words during training.

I think the problem in your case is the following: the parallel corpus contains target words which do not exist in the monolingual data you use to train the language model. You learn translation rules containing such target words. When decoding, OxLM replaces these words with <unk> (because no distributed representation was learned for them during training) and assumes that the distributed representation for <unk> is available. In reality, you don't have that either and it ends up accessing something out of bounds. Since I don't have access to your data, can you confirm the scenario is true?

lvapeab commented 9 years ago

Ok, you are right, I needed the <unk> representation. I preprocessed the corpus with your scripts and everything works like a charm :)

I'm still a bit confused, since the scenario you described is not 100% true: I do not have any target word out of the training vocabulary of the language model. But I guess the neural model needs to have the <unk> representation.

Anyway, this was just a tiny experiment with a toy corpus for testing the software. I move now to larger tasks, the corpora preprocessing and vocabulary reduction is mandatory, so this problem will appear not anymore.

Thank you so much for the help!

pauldb89 commented 9 years ago

I've just remembered this: When the phrase table is loaded, Moses scores the translation rules (not sure why, because this is done again when decoding). Since the rules may have an incomplete n-gram context, they get padded with <unk>s automatically. Hence the need for its representation.

I'll add a warning message for when people attempt to train models without <unk> in the training data. Thanks for the feedback and let me know if you run into any problems.