Open lefterav opened 9 years ago
Hi, I'm having the same problem mentioned above by Lefteris. I'm running the latest version of MGIZA on openSUSE 12.2 . I ran force-align-moses script to align new data. The error message I get when loading the N table is :
We are going to load previous N model from giza.ja-en/ja-en.n3.final Reading fertility table from giza.ja-en/ja-en.n3.final ./force-align-moses.sh: line 40: 984 Segmentation fault $MGIZA giza.$TGT-$SRC/$TGT-$SRC.gizacfg -c $ROOT/corpus/$TGT-$SRC.snt -o $ROOT/giza.${TGT}-${SRC}/$TGT-${SRC} -s $ROOT/corpus/$SRC.vcb -t $ROOT/corpus/$TGT.vcb -m1 0 -m2 0 -mh 0 -coocurrence $ROOT/giza.${TGT}-${SRC}/$TGT-${SRC}.cooc -restart 11 -previoust giza.$TGT-$SRC/$TGT-$SRC.t3.final -previousa giza.$TGT-$SRC/$TGT-$SRC.a3.final -previousd giza.$TGT-$SRC/$TGT-$SRC.d3.final -previousn giza.$TGT-$SRC/$TGT-$SRC.n3.final -previousd4 giza.$TGT-$SRC/$TGT-$SRC.d4.final -previousd42 giza.$TGT-$SRC/$TGT-$SRC.D4.final -m3 0 -m4 1
I have 787264 entires in the ja-en.n3.final file. I reduced the N table size and it also worked. Any suggestion on how to solve it?
Many thanks
Hello,
I think that this problem occurs in the file: NTables.cpp
More specifically the following lines of code:
while(!inf.eof()){ nFert++; inf >> ws >> tok; if (tok > MAX_VOCAB_SIZE){ cerr << "NTables:readNTable(): unrecognized token id: " << tok <<'\n'; exit(-1); } for(i = 0; i < MAX_FERTILITY; i++){ inf >> ws >> prob; getRef(tok, i)=prob; } }
Maybe at some point of time an index violation is done.
Perhaps: MAX_FERTILITY is at fault ???
I am just speculating.
Hope this helps.
I'm closing this issue 'cos it hasn't been answered for a while. Reopen if u wanna carry on chatting
This a show-stopper for the force alignment feature and as it seems it has not been solved. I would like to keep this open. I would be happy to help in futher debugging.
no worries. It might be a good idea to make your data available so people can reproduce it. Otherwise the issue isn't gonna get anywhere
I'm having the same problem when it's chinese-english. mgiza
on en-zh works but zh-en, it died after HMM training started in model 1:
Normalizing T
DONE Normalizing
Model1: (5) TRAIN CROSS-ENTROPY 7.45211 PERPLEXITY 175.109
Model1: (5) VITERBI TRAIN CROSS-ENTROPY 8.16385 PERPLEXITY 286.791
Model 1 Iteration: 5 took: 107 seconds
Entire Model1 Training took: 525 seconds
NOTE: I am doing iterations with the HMM model!
Read classes: #words: 316590 #classes: 50
Actual number of read words: 316592 stored words: 316356
Read classes: #words: 825717 #classes: 50
Actual number of read words: 825719 stored words: 824545
==========================================================
Hmm Training Started at: Thu Nov 24 09:35:52 2016
-----------
Hmm: Iteration 1
Dump files 0 it 1 noIterations 5 dumpFreq 0
Reading more sentence pairs into memory ...
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
Segmentation fault (core dumped)
I suspect it's it's the fertility too but it's rather strange because i would expect it to be Oh wait it's zero indexed, so <= MAX_FERTILITY
since the default ratio with clean-corpus-n.perl
is set to 9 and MAX_FERTILITY
is set to 9.< MAX_FERTILITY
corresponds to ratio=9
in clean-corpus-n.perl
.
This unusually high fertility will almost always happen esp. when aligning logographic (Japanese/Chinese) languages to alphabetic ones. But they are rather rare < 200K sentence pairs from my 10M sample and most probably part of it is misaligned sentences or non-monotonic sentence alignments.
I'm trying to turn down to a max ratio of 5 at cleaning and i suppose mgiza
would be happy. Let's see in 5-6 hours.
So the training works when I have fertility set to 5, 6, 7, 8 and even 9.
I've doubled checked, if ratio is set <= 9 when cleaning this shouldn't occur, i don't know how but i had rogue lines with ration > 9 that snugged in and mgiza
by default doesn't like that.
I am trying to produce word alignment for individual sentences. For this purpose I am using the "force align" functionality of mgiza++ Unfortunately when I am loading a big N table (fertility), mgiza crashes with a segmentation fault.
In particular, I have initially run mgiza on the full training parallel corpus using the default settings of the Moses script:
Afterwards, by executing the mgiza force-align script, I run the following command
This runs fine, until I get the following error:
The n-table that is failing has about 300k entries. For this reason, I thought I should try to see if the size is a problem. So I concatenated the table to 60k entries. And it works! But the alignments are not good.
I am struggling to fix this, so any help would be appreciated. I am running a freshly installed mgiza, on Ubuntu 12.04