moses-smt / mgiza

A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.
161 stars 60 forks source link

mgiza++ force alignment: segmentation fault when reloading a big N table #2

Open lefterav opened 9 years ago

lefterav commented 9 years ago

I am trying to produce word alignment for individual sentences. For this purpose I am using the "force align" functionality of mgiza++ Unfortunately when I am loading a big N table (fertility), mgiza crashes with a segmentation fault.

In particular, I have initially run mgiza on the full training parallel corpus using the default settings of the Moses script:

/project/qtleap/software/moses-2.1.1/bin/training-tools/mgiza  -CoocurrenceFile /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de.cooc -c /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en-de-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 24 -nodumps 0 -nsmooth 4 -o /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de -onlyaldumps 0 -p0 0.999 -s /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/de.vcb -t /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en.vcb

Afterwards, by executing the mgiza force-align script, I run the following command

/project/qtleap/software/moses-2.1.1/mgizapp-code/mgizapp//bin/mgiza giza.en-de/en-de.gizacfg -c /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en-de.snt -o /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de -s /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./de.vcb -t /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en.vcb -m1 0 -m2 0 -mh 0 -coocurrence /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de.cooc -restart 11 -previoust giza.en-de/en-de.t3.final -previousa giza.en-de/en-de.a3.final -previousd giza.en-de/en-de.d3.final -previousn giza.en-de/en-de.n3.final -previousd4 giza.en-de/en-de.d4.final -previousd42 giza.en-de/en-de.D4.final -m3 0 -m4 1

This runs fine, until I get the following error:

  We are going to load previous N model from giza.en-de/en-de.n3.final

Reading fertility table from giza.en-de/en-de.n3.final

Segmentation fault (core dumped)

The n-table that is failing has about 300k entries. For this reason, I thought I should try to see if the size is a problem. So I concatenated the table to 60k entries. And it works! But the alignments are not good.

I am struggling to fix this, so any help would be appreciated. I am running a freshly installed mgiza, on Ubuntu 12.04

hala-maghout commented 9 years ago

Hi, I'm having the same problem mentioned above by Lefteris. I'm running the latest version of MGIZA on openSUSE 12.2 . I ran force-align-moses script to align new data. The error message I get when loading the N table is :

We are going to load previous N model from giza.ja-en/ja-en.n3.final Reading fertility table from giza.ja-en/ja-en.n3.final ./force-align-moses.sh: line 40: 984 Segmentation fault $MGIZA giza.$TGT-$SRC/$TGT-$SRC.gizacfg -c $ROOT/corpus/$TGT-$SRC.snt -o $ROOT/giza.${TGT}-${SRC}/$TGT-${SRC} -s $ROOT/corpus/$SRC.vcb -t $ROOT/corpus/$TGT.vcb -m1 0 -m2 0 -mh 0 -coocurrence $ROOT/giza.${TGT}-${SRC}/$TGT-${SRC}.cooc -restart 11 -previoust giza.$TGT-$SRC/$TGT-$SRC.t3.final -previousa giza.$TGT-$SRC/$TGT-$SRC.a3.final -previousd giza.$TGT-$SRC/$TGT-$SRC.d3.final -previousn giza.$TGT-$SRC/$TGT-$SRC.n3.final -previousd4 giza.$TGT-$SRC/$TGT-$SRC.d4.final -previousd42 giza.$TGT-$SRC/$TGT-$SRC.D4.final -m3 0 -m4 1

I have 787264 entires in the ja-en.n3.final file. I reduced the N table size and it also worked. Any suggestion on how to solve it?

Many thanks

prajdabre commented 9 years ago

Hello,

I think that this problem occurs in the file: NTables.cpp

More specifically the following lines of code:

while(!inf.eof()){ nFert++; inf >> ws >> tok; if (tok > MAX_VOCAB_SIZE){ cerr << "NTables:readNTable(): unrecognized token id: " << tok <<'\n'; exit(-1); } for(i = 0; i < MAX_FERTILITY; i++){ inf >> ws >> prob; getRef(tok, i)=prob; } }

Maybe at some point of time an index violation is done.

Perhaps: MAX_FERTILITY is at fault ???

I am just speculating.

Hope this helps.

hieuhoang commented 9 years ago

I'm closing this issue 'cos it hasn't been answered for a while. Reopen if u wanna carry on chatting

lefterav commented 9 years ago

This a show-stopper for the force alignment feature and as it seems it has not been solved. I would like to keep this open. I would be happy to help in futher debugging.

hieuhoang commented 9 years ago

no worries. It might be a good idea to make your data available so people can reproduce it. Otherwise the issue isn't gonna get anywhere

alvations commented 7 years ago

I'm having the same problem when it's chinese-english. mgiza on en-zh works but zh-en, it died after HMM training started in model 1:

Normalizing T 
 DONE Normalizing 
Model1: (5) TRAIN CROSS-ENTROPY 7.45211 PERPLEXITY 175.109
Model1: (5) VITERBI TRAIN CROSS-ENTROPY 8.16385 PERPLEXITY 286.791
Model 1 Iteration: 5 took: 107 seconds
Entire Model1 Training took: 525 seconds
NOTE: I am doing iterations with the HMM model!
Read classes: #words: 316590  #classes: 50
Actual number of read words: 316592 stored words: 316356
Read classes: #words: 825717  #classes: 50
Actual number of read words: 825719 stored words: 824545

==========================================================
Hmm Training Started at: Thu Nov 24 09:35:52 2016

-----------
Hmm: Iteration 1
Dump files 0 it 1 noIterations 5 dumpFreq 0
Reading more sentence pairs into memory ... 
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
Segmentation fault (core dumped)

I suspect it's it's the fertility too but it's rather strange because i would expect it to be <= MAX_FERTILITY since the default ratio with clean-corpus-n.perl is set to 9 and MAX_FERTILITY is set to 9. Oh wait it's zero indexed, so < MAX_FERTILITY corresponds to ratio=9 in clean-corpus-n.perl.

This unusually high fertility will almost always happen esp. when aligning logographic (Japanese/Chinese) languages to alphabetic ones. But they are rather rare < 200K sentence pairs from my 10M sample and most probably part of it is misaligned sentences or non-monotonic sentence alignments.

I'm trying to turn down to a max ratio of 5 at cleaning and i suppose mgiza would be happy. Let's see in 5-6 hours.


So the training works when I have fertility set to 5, 6, 7, 8 and even 9.

I've doubled checked, if ratio is set <= 9 when cleaning this shouldn't occur, i don't know how but i had rogue lines with ration > 9 that snugged in and mgiza by default doesn't like that.