moses-smt / mosesdecoder

Moses, the machine translation system
http://www.statmt.org/moses
GNU Lesser General Public License v2.1
1.58k stars 775 forks source link

Regression, segmentation fault in mosesserver #76

Closed proycon closed 9 years ago

proycon commented 9 years ago

A bug appeared in mosesserver that used not to be there in an older version. I'm running the latest git version. I get a segmentation fault which appears underministically (underlying memory issue perhaps?). The bug only appears with the server and not with the normal moses.

I start Mosesserver as follows:

mosesserver --server-port 8080 -xml-input inclusive -f ep7os12-mosesbaseline/fallback.moses.ini -n-best-list ep7os12-mosesbaseline/nbest.txt 25

Then I provide Moses input with XML markup, only one small L1 fragment in L2 context is to be translated. Moses is trained to translate English to German:

<w translation="Oft">Oft</w><wall/><w translation="gibt">gibt</w><wall/><w translation="es">es</w><wall/>various<wall/>reasons<wall/><w translation="für">für</w><wall/><w translation="das">das</w><wall/><w translation="Dilemma">Dilemma</w><wall/><w translation=".">.</w><wall/>

This often goes well for a few sentences but then breaks. Here's a gdb trace of when it fails:

[contrib/server/mosesserver.cpp:708] Listening on port 8080 [contrib/server/mosesserver.cpp:234] Input: OftgibtesvariousreasonsfürdasDilemma. Translating: Oft gibt es various reasons für das Dilemma . Line 0: Collecting options took 0.000134339 seconds at moses/Manager.cpp:110 Line 0: Search took 0.00144825 seconds [contrib/server/mosesserver.cpp:340] Output: Oft gibt es verschiedene Gründe für das Dilemma .
[Thread 0x7ffff7ff1300 (LWP 29516) exited] [New Thread 0x7ffff7ff1300 (LWP 29559)] [contrib/server/mosesserver.cpp:234] Input: OftgibtesvariousreasonsfürdasDilemma. Translating: Oft gibt es various reasons für das Dilemma . Line 0: Collecting options took 0.000134644 seconds at moses/Manager.cpp:110 Line 0: Search took 0.00143984 seconds [contrib/server/mosesserver.cpp:340] Output: Oft gibt es verschiedene Gründe für das Dilemma .

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffe97ec700 (LWP 27461)] 0x000000000055f9c7 in Moses::ThreadPool::Execute (this=0x22fc5f28) at moses/ThreadPool.cpp:59 59 if (task->DeleteAfterExecution()) { (gdb) bt

0 0x000000000055f9c7 in Moses::ThreadPool::Execute (this=0x22fc5f28) at moses/ThreadPool.cpp:59

1 0x00000000006aa964 in thread_proxy ()

2 0x00007ffff73a7e9a in start_thread (arg=0x7fffe97ec700) at pthread_create.c:308

3 0x00007ffff648e31d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

4 0x0000000000000000 in ?? ()

You see in this example I passed the same input twice, the first time it went fine, and the second time it segfaulted. In a prior version all was fine.

I'm hoping somebody more familiar with the Moses codebase has an idea what might be wrong?

proycon commented 9 years ago

(the attempted fix didn't work either)

I suspect this bug appeared with https://github.com/moses-smt/mosesdecoder/commit/eaa7652bacea22379753db0f2a244a9ec4f00d22

proycon commented 9 years ago

Confirmed, reverting that commit fixes the problem, but I suppose retaining the commit and a real fix is preferred.

hieuhoang commented 9 years ago

is it possible for you to make available the model files you're so I can reproduce the problem. I can't revert the commit, it there to fix another problem

proycon commented 9 years ago

Of course, I put them here: http://lst.science.ru.nl/~proycon/mosesissue76.tar.bz2

It includes a small python script that acts as a client and sends the example input 1000 times.

hieuhoang commented 9 years ago

thanks. Just looking @ it now. It may be that the threadpool isn't entirely thread-safe. Gotta wait 'til barry gets back in a few days. Go with what you have for now

proycon commented 9 years ago

Ok, thanks. No problem, I have a good workaround with the revert of that commit.

hieuhoang commented 9 years ago

we had a look at this. It was hard to diagnose but the fix was eventually easy https://github.com/moses-smt/mosesdecoder/commit/c8e49177a6c00283d9fec34291ea31b000dccc50 There's a race which meant task->DeleteAfterExecution() in Threadpoo.cpp was called after task had been deleted in another thread. Now that call is moved to before the task is run and can ever be deleted.

If you have time, i would be grateful if you can test the new code and let us know if it works ok

proycon commented 9 years ago

Thanks! My test indeed runs fine now!