moses-smt / mgiza

A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.
161 stars 60 forks source link

Error in running mgiza, one thread return error code 139 #5

Closed ChenyangLiu closed 9 years ago

ChenyangLiu commented 9 years ago

I used 64 CPUs to run mgiza, and my training steps are five IBM model 1, five HMM, three model 3, three model 4. But when I run the first HMM training, the No.9 thread always failed. It returns Error code 139, and no file outputs. What happened? Could you tell me what's the meaning of code 139? Thank you.

alvations commented 9 years ago

Did you use train-model.perl from mosesdecoder? Or did you use mgiza natively? What is the command you use to call mgiza?

hieuhoang commented 9 years ago

I don't think you should use mgiza with more than 8 threads.

I seem to remember a conversation with someone who knew the code that it has some multi-threading limit due to some obscure internal datastructure, or the way it names intermediate files etc. I'm not sure if these things have been fixed, I assume is hasn't

On 28/07/2015 12:46, ChenyangLiu wrote:

My training steps are five IBM model 1, five HMM, three model 3, three model 4.

— Reply to this email directly or view it on GitHub https://github.com/moses-smt/mgiza/issues/5#issuecomment-125507003.

Hieu Hoang Researcher New York University, Abu Dhabi http://www.hoang.co.uk/hieu

talltoon commented 9 years ago

I run mgiza with up to 10 threads without problems. But any more threads will bring mgiza to its knees.

On 28 July 2015 at 15:39, Hieu Hoang notifications@github.com wrote:

I don't think you should use mgiza with more than 8 threads.

I seem to remember a conversation with someone who knew the code that it has some multi-threading limit due to some obscure internal datastructure, or the way it names intermediate files etc. I'm not sure if these things have been fixed, I assume is hasn't

On 28/07/2015 12:46, ChenyangLiu wrote:

My training steps are five IBM model 1, five HMM, three model 3, three model 4.

— Reply to this email directly or view it on GitHub https://github.com/moses-smt/mgiza/issues/5#issuecomment-125507003.

Hieu Hoang Researcher New York University, Abu Dhabi http://www.hoang.co.uk/hieu

— Reply to this email directly or view it on GitHub https://github.com/moses-smt/mgiza/issues/5#issuecomment-125631549.

Guchun Zhang

Machine Translation Project Lead Alpha CRC | Cambridge, UK Direct: +44 1223 431035

www.alphacrc.com

www.linkedin.com/company/alpha-crc

gzhang@alphacrc.com

Alpha CRC = Global, Scalable, In-House Production

ghost commented 9 years ago

The issue is converting thread id to file suffix string by doing '0' + thread. So 10 threads should be fine since it does '0' through '9'.
But I thought that was patched by now?

On 28/07/15 22:47, talltoon wrote:

I run mgiza with up to 10 threads without problems. But any more threads will bring mgiza to its knees.

On 28 July 2015 at 15:39, Hieu Hoang notifications@github.com wrote:

I don't think you should use mgiza with more than 8 threads.

I seem to remember a conversation with someone who knew the code that it has some multi-threading limit due to some obscure internal datastructure, or the way it names intermediate files etc. I'm not sure if these things have been fixed, I assume is hasn't

On 28/07/2015 12:46, ChenyangLiu wrote:

My training steps are five IBM model 1, five HMM, three model 3, three model 4.

— Reply to this email directly or view it on GitHub https://github.com/moses-smt/mgiza/issues/5#issuecomment-125507003.

Hieu Hoang Researcher New York University, Abu Dhabi http://www.hoang.co.uk/hieu

— Reply to this email directly or view it on GitHub https://github.com/moses-smt/mgiza/issues/5#issuecomment-125631549.

Guchun Zhang

Machine Translation Project Lead Alpha CRC | Cambridge, UK Direct: +44 1223 431035

www.alphacrc.com

www.linkedin.com/company/alpha-crc

gzhang@alphacrc.com

Alpha CRC = Global, Scalable, In-House Production

— Reply to this email directly or view it on GitHub https://github.com/moses-smt/mgiza/issues/5#issuecomment-125634864.

ChenyangLiu commented 9 years ago

In fact, I exec mgiza in my Java problem. It's a distributed cluster. And I finished it once. But when I change the corpus, It failed. And training IBM model 1 succeed.

mosesadmin commented 9 years ago

Does it work when you only use 10 threads or less?

Hieu Hoang Sent while bumping into things On 28 Jul 2015 7:39 pm, "ChenyangLiu" notifications@github.com wrote:

In fact, I exec mgiza in my Java problem. It's a distributed cluster. And I finished it once. But when I change the corpus, It failed. And training IBM model 1 succeed.

— Reply to this email directly or view it on GitHub https://github.com/moses-smt/mgiza/issues/5#issuecomment-125657210.

ChenyangLiu commented 9 years ago

I wrote a simple MapReduce-like program. And I use that to run mgiza on my cluster.

I parted corpus to 64 parts, and Map them to cluster. Each machine run mgiza independently, and only 4 threads at most on each machine. So I don't think that the reason is too much threads.

The MapReduce program works well with other programs. And mgiza succeed with a different corpus. So I doubt there are some problem on my corpus. But the corpus is too big. So I want to know what's the meaning of return code 139.

hieuhoang commented 9 years ago

ah, i see. 139 just mean there's a segfault, its not very informative. What is the EXACT mgiza command that your mapreduce program ran when this happened?

did you clean your training data before giving it to mgiza? ie. make sure that it is encoded in utf8, there is no great disparity in the lengths of the source and target sentence etc. The moses script clean-corpus-n.perl would have done this for you

ChenyangLiu commented 9 years ago

I used Java to run mgiza. Here is my code.

Runtime rt = Runtime.getRuntime();
Process process = rt.exec(cmd, envs, OLConfig.getHomeDir());

cmd: exec/wa_bin/amd64/mgiza /localtmp/mgiza.conf

The contents in .conf is

adbackoff 0
c /disk8/cowork-tmp/outback/liuchy_exec_mgiza_wa_tool_4_6al79r.0.9-tn1sem/localtmp/corpus.snt
compactadtable 1
compactalignmentformat 0
coocurrencefile /disk8/cowork-tmp/outback/liuchy_exec_mgiza_wa_tool_4_6al79r.0.9-tn1sem/localtmp/corpus.cooc
countcutoff 1e-06
countcutoffal 1e-05
countincreasecutoff 1e-07
countincreasecutoffal 1e-07
deficientdistortionforemptyword 0
depm4 76
depm5 68
dictionary 
dopeggingyn 0
emalignmentdependencies 2
emalsmooth 0.2
emprobforempty 0.4
emsmoothhmm 2
log 0
manlexfactor1 0
manlexfactor2 0
manlexmaxmultiplicity 20
maxfertility 6
maxsentencelength 101
mincountincrease 1e-07
model23smoothfactor 0
model5smoothfactor 0.1
nbestalignments 0
ncpus 1
nodumps 0
nofiledumpsyn 0
nsmooth 4
nsmoothgeneral 0
o /disk8/cowork-tmp/outback/liuchy_exec_mgiza_wa_tool_4_6al79r.0.9-tn1sem/localtmp/norm
onlyaldumps 0
p 0
p0 0.999
peggedcutoff 0.03
pegging 0
probcutoff 1e-07
probsmooth 1e-07
readtableprefix 
s /disk8/cowork-tmp/outback/liuchy_exec_mgiza_wa_tool_4_6al79r.0.9-tn1sem/localtmp/src.vocb
t /disk8/cowork-tmp/outback/liuchy_exec_mgiza_wa_tool_4_6al79r.0.9-tn1sem/localtmp/tgt.vocb
tc 
verbose 0
verbosesentence -10
m1 0
mh 1
m3 0
m4 0
t1 0
th 1
t3 0
t4 0
restart 4
dumpcount 1
countoutputprefix /disk8/cowork-tmp/outback/liuchy_exec_mgiza_wa_tool_4_6al79r.0.9-tn1sem/localtmp/partial
previoust /disk8/cowork-tmp/outback/liuchy_exec_mgiza_wa_tool_4_6al79r.0.9-tn1sem/localtmp/t.model

And I'm sure that the corpus is cleaned. Because Mgiza is succeed when I swap src and tgt.

hieuhoang commented 9 years ago

ok. I don't really know what this is, looks like a debug message from MapReduce.

I'm not sure I can help you. You should try to debug mgiza yourself by running it directly on the command line (ie not with the java code). Run it with a small number of threads and small amount of data. Gradullay increase the data and the threads until a segfault happens. Isolate the problem.

ChenyangLiu commented 9 years ago

Thank you for your help. I'm going to close this issue and trying your method.

Thanks!

jtv commented 9 years ago

On 28/07/15 21:58, Kenneth Heafield wrote:

The issue is converting thread id to file suffix string by doing '0' + thread. So 10 threads should be fine since it does '0' through '9'. But I thought that was patched by now?

I did patch that one, though of course it's possible that I missed a spot. As far as I could find places that did that, they now use multiple digits.

Jeroen