moses-smt / mosesdecoder

Moses, the machine translation system
http://www.statmt.org/moses
GNU Lesser General Public License v2.1
1.57k stars 775 forks source link

Multithreading is broken for hierarchical Moses #39

Closed mjpost closed 7 years ago

mjpost commented 11 years ago

Hierarchical multithreading appears to be broken (haven't tested with phrase-based). For the big picture, here is a plot of decoding times with Moses for a large German-English grammar (Europarl + Common Crawl) and three large LMs on a 64-core machine:

moses 1

The saturation point is reached pretty quickly --- around 16 threads.

Looking at the log files, I see that, with 1 thread, the maximum reported sentence-level decoding time (newstest2012) is about 6 seconds, and with 48 threads, it is over 3,000 seconds. Also, the logfile with large thread counts does not contain the logging output for all sentences.

I suspect there is a problem with your locking!

kpu commented 11 years ago

Y = time, X = CPUs?
With or without tcmalloc?

And yes there is a problem with locking.

mjpost commented 11 years ago

Whoops, sorry. y = end-to-end CPU time in seconds (including model loading etc). x = # threads. All was done on a 64-processor machine with 256 GB of RAM with Moses compiled the default way (./bjam). This was also done on a networked file system.

mjpost commented 11 years ago

Also, each point is the minimum of at least 15 runs taken at various times over a period of days.

kpu commented 11 years ago

I'd run cat >/dev/null on all your files before starting the clock to minimize the risk from network filesystems.

Can you run bjam --debug-configuration ? Near the top you will see a series of bash commands and return codes. Look specifically for these:

bash -c "g++ -ltcmalloc_minimal -x c++ - <<<'int main() {}' -o /home/kpu/mosesdecoder/dummy >/dev/null 2>/dev/null && rm /home/kpu/mosesdecoder/dummy 2>/dev/null" 0 bash -c "g++ -static -ltcmalloc_minimal -x c++ - <<<'int main() {}' -o /home/kpu/mosesdecoder/dummy >/dev/null 2>/dev/null && rm /home/kpu/mosesdecoder/dummy 2>/dev/null" 1

If you have a 1 for both, then you have not compiled with tcmalloc. tcmalloc makes threading substantially faster.

I'm still agreeing with you that Moses has locking issues though.

kpu commented 11 years ago

Also see the flags section of http://www.statmt.org/moses/?n=Moses.Optimize .

mjpost commented 11 years ago

I only see the non-static call, and it returned 1.

We didn't' do any of the optimizations; we just took Moses as-is. I'll rerun an experiment with the entire thing running from local disk (using the cat trick) and let you know what kind of difference that makes.

hieuhoang commented 11 years ago

there's been public and edinburgh-internal discussions about threading & locking. it is a shame that it bottoms out at 16 threads but tbh, i don't know how to make it any better without reducing functionalirty in a significant way.

eg. there's locking on the FactorCollection (vocab object to most people). How can it be squeezed any more? spinlock? perfect hash?

mjpost commented 11 years ago

It doesn't appear to be disk issues. I put everything on a local disk, did the cat trick for all models, and then decoded with the SCFG model. Still 1500 seconds (3x Joshua) for 48 threads.

I looked into installing tcmalloc but the INSTALL file said to install libunwind first for 64 bit architectures, and am scared off for the moment by the prospect of an unbounded chain of dependencies...

kpu commented 11 years ago

tcmalloc is part of Google perf tools that also includes stack trace stuff. However, you don't need stack traces.

See Moses's BUILD-INSTRUCTIONS.txt section "ADVICE ON INSTALLING EXTERNAL LIBRARIES" The configure command for tcmalloc is: ./configure --prefix=$PREFIX --libdir=$LIBDIR --enable-shared --enable-static --enable-minimal

hieuhoang commented 11 years ago

3x Joshua as in 3 time slower than Joshua, default settings for each decoder? What's the load/decoding time breakdown?

imo, ken's way of measuring model score v. time is a good method as we don't have to argue about the setting of each and every parameter.

The load/decode time and ratio should be clearly given otherwise we're just measuing how fast you are in loading.

mjpost commented 11 years ago

That's a fair suggestion, which is relevant to the Joshua / Moses comparison, although it doesn't address the issue that Moses seems to saturate at around 16 threads (which I'm looking into with a Moses recompile with tcmalloc).

Does Moses sort the entire grammar at load time? Joshua amortizes sorting (only sorting a grammar trie node when it's needed), so if Moses does it upfront, I'd have to rerun Joshua turning off amortization in order for the comparison to be fair.

hieuhoang commented 11 years ago

Yes, the memory based pt sorts every node at initial load time. The binary sorts before decoding each sentence.

I used simple comparisons when I was developing the syntax decoder. But I think ken's method is the only 1 I would trust for better comparison. We can't even compare when we have the same cube pruning limit and max span lengths. our sentence lengths are different for the same sentence!

Hieu Sent while bumping into things

On 11 Jun 2013, at 03:24 AM, Matt Post notifications@github.com wrote:

That's a fair suggestion, which is relevant to the Joshua / Moses comparison, although it doesn't address the issue that Moses seems to saturate at around 16 threads (which I'm looking into with a Moses recompile with tcmalloc).

Does Moses sort the entire grammar at load time? Joshua amortizes sorting (only sorting a grammar trie node when it's needed), so if Moses does it upfront, I'd have to rerun Joshua turning off amortization in order for the comparison to be fair.

— Reply to this email directly or view it on GitHubhttps://github.com/moses-smt/mosesdecoder/issues/39#issuecomment-19237556 .

hieuhoang commented 11 years ago

It's a shame it's maxed out at 16 threads. Not a high priority but we may get round to fixing it if the NSA gets a new batch of Facebook statuses to translate..,

Hieu Sent while bumping into things

On 11 Jun 2013, at 03:24 AM, Matt Post notifications@github.com wrote:

That's a fair suggestion, which is relevant to the Joshua / Moses comparison, although it doesn't address the issue that Moses seems to saturate at around 16 threads (which I'm looking into with a Moses recompile with tcmalloc).

Does Moses sort the entire grammar at load time? Joshua amortizes sorting (only sorting a grammar trie node when it's needed), so if Moses does it upfront, I'd have to rerun Joshua turning off amortization in order for the comparison to be fair.

— Reply to this email directly or view it on GitHubhttps://github.com/moses-smt/mosesdecoder/issues/39#issuecomment-19237556 .

kpu commented 11 years ago

In Matt's defense, rumor has it that he has time-accuracy curves as well. It's just that this issue is about Moses's excessive lock contention and he has provided the most relevant plot.

The phrase cache has a process-wide mutex (not even read-write) for LRU. I haven't seen evidence that LRU is the best eviction strategy, but even if it is, 64-bit ints are atomic on x68_64 so that mutex should be ifdef'd out on the most commonly used architecture.

The impact of tcmalloc is substantial: see http://www.mail-archive.com/moses-support@mit.edu/msg07303.html so I'm going to insist on tcmalloc curves anyway.

hieuhoang commented 11 years ago

fair point. it is useful to know wheere it bottoms out. Now just need a 16+ core laptop to debug the problem.

i don't think the caching is used for hiero/syntax decoding.

my reason for wanting matt to use your method is that it's been the only occasion when timing result is useful and interesting. It tells us the relatuve speed and the convergence and search errors.

comparing with default parameters is as useful as a 3-legged elephant. We have wildly different parameters. Comparing with the same parameters only gives a ballpark figure 'cos the devil is in the details.

pjwilliams commented 11 years ago

I'm also keen to see the tcmalloc curves. I spent some time looking at thread count vs decoding time when I hacked in the multithreaded support for moses_chart a couple of years back and unless I missed something -- definitely possible -- there was actually very little locking being done explicitly in Moses; the main problem seemed to me to be our use of dynamic memory allocation where obviously there has to be some kind of locking behind the scenes. I added per-thread object pools for a couple of the most frequently new-ed objects and that helped somewhat. tcmalloc made a significant difference as well, as Ken says. But there's probably a lot more we can do.

Phil

On 11 Jun 2013, at 03:49, Kenneth Heafield notifications@github.com wrote:

In Matt's defense, rumor has it that he has time-accuracy curves as well. It's just that this issue is about Moses's excessive lock contention and he has provided the most relevant plot.

The phrase cache has a process-wide mutex (not even read-write) for LRU. I haven't seen evidence that LRU is the best eviction strategy, but even if it is, 64-bit ints are atomic on x68_64 so that mutex should be ifdef'd out on the most commonly used architecture.

The impact of tcmalloc is substantial: see http://www.mail-archive.com/moses-support@mit.edu/msg07303.html so I'm going to insist on tcmalloc curves anyway.

— Reply to this email directly or view it on GitHub.

mjpost commented 11 years ago

Update: Here are numbers for 16+ threads, decoding under the old setting (from before, i.e., no tcmalloc etc).

threads run#    runtime - loadtime = decodetime
16  1   1589.000 - 1313 = 276.000
16  2   2770.000 - 2595 = 175.000
16  3   1362.000 - 1184 = 178.000
16  4   1359.000 - 1182 = 177.000
16  5   1741.000 - 1565 = 176.000
32  1   1540.000 - 1351 = 189.000
32  2   1432.000 - 1245 = 187.000
32  3   1476.000 - 1288 = 188.000
32  4   1407.000 - 1214 = 193.000
32  5   1397.000 - 1185 = 212.000
48  1   1587.000 - 1201 = 386.000
48  2   1600.000 - 1235 = 365.000
48  3   1549.000 - 1195 = 354.000
48  4   1549.000 - 1172 = 377.000
48  5   1516.000 - 1170 = 346.000

I then recompiled Moses with Ken's help: adding tcmalloc, setting the max LM order, removing debugging.I reserved a 64-CPU machine and was the sole user of it. Here are early numbers:

threads run#    runtime - loadtime = decodetime
16  1   1191.000 - 1017 = 174.000
32  1   2046.000 - 1823 = 223.000
48  1   1381.000 - 997 = 384.000

It looks like there is essentially no difference with high numbers of threads. I will update these numbers if further runs result in meaningfully lower average runtimes.

bhaddow commented 11 years ago

Hieu - it's not really the comparison with Joshua that matters. Moses should get faster as you add more threads, not slower. A lot of our servers are 24 core, and if we buy another lot then they'll probably have more cores so it would be good to be able to use them. I'm guessing it's the vocab lock as I don't think the t-options cache is used in chart-based Moses.

rsennrich commented 11 years ago

apart from memory consumption, is there any need to share the vocab across threads? If not, then it may be worth profiling scalability (in terms of both memory and speed) with thread-local vocabs.

hieuhoang commented 11 years ago

good idea! i can't think there's any reason to have a global vocab.

i'm gonna test whether it actually is the vocab locking that's the problem. If it is, then i'll do as you suggest.

give me a bit time

kpu commented 11 years ago

I care about memory consumption of the vocab because I have big language models. This would be less of an issue if the language model didn't have to give its vocab to FactorCollection (and instead words were looked up on demand).

Might I suggest this more complicated option? The phrase table and language model always intern their strings at load time in a global process-wide FactorCollection. Then the only thing we have to handle is OOVs. This would be done sentence-locally in an object. However, both with provide const Factor * so recovering strings is the same and most code isn't changed. The only place where we would need a change is converting strings to const Factor *. Which shouldn't be happening except to read the source sentence. (I realize this happens in OnDiskPt and consider that a severe performance bug.)

bhaddow commented 11 years ago

I'd be interested to see how these curves look for pb moses with the on-disk table. My feeling is that the lock contention could kick in much sooner because of the global cache lock.

bhaddow commented 11 years ago

Also, it would be interesting to run Matt's tests with the lock in FactorCollection removed. It might crash or give weird translations, but the timing figures could tell us if this lock is the problem.

hieuhoang commented 11 years ago

was thinking of testing with locks removed. WIll preload FactorCollection with vocab if i have to. This will give us an indication whether the problem is actually with this lock or not. I've got a strange premonition it's not.

there's an issue with rico's idea - it won't work translation option caching.

spinlock anyone? http://www.boost.org/doc/libs/1_53_0/doc/html/atomic/usage_examples.html

kpu commented 11 years ago

there's an issue with rico's idea - it won't work translation option caching.

There's a bug with the way phrase tables work in Moses. They don't tell FactorCollection their vocab at the beginning then do number mapping.

hieuhoang commented 11 years ago

that makes a some features much harder to implement in the decoder, eg. randomized and continuous space LM, 'cos they don't know their vocab before decoding. I'd prefer to deal with the locks.

lemme just see whether it is the vocab lock before we argue what to do about it

ugermann commented 11 years ago

Have you considered maintaining two vocabularies in the decoder: one fixed, read-only for know vocab items, and one dynamic with lock for unknown items? That way you don't have to lock the vocabulary for look-up of known vocab items, which will be the vast majority.

On Wed, Jun 12, 2013 at 3:05 PM, Hieu Hoang notifications@github.comwrote:

that makes a some features much harder to implement in the decoder, eg. randomized and continuous space LM, 'cos they don't know their vocab before decoding. I'd prefer to deal with the locks.

lemme just see whether it is the vocab lock before we argue what to do about it

— Reply to this email directly or view it on GitHubhttps://github.com/moses-smt/mosesdecoder/issues/39#issuecomment-19327574 .

Ulrich Germann Research Associate School of Informatics University of Edinburgh

jganitkevitch commented 11 years ago

I don't think there's actually any need to share OOV vocab items across threads, so the writable Vocab could just be thead-local, except perhaps when you have a generating module like a transliterator or morphological component that you won't want to re-run.

cfedermann commented 11 years ago

off-topic: the last time somebody used a transliterator component, it affected your last name, Juri :P

On Wed, Jun 12, 2013 at 4:44 PM, Juri Ganitkevitch <notifications@github.com

wrote:

I don't think there's actually any need to share OOV vocab items across threads, so the writable Vocab could just be thead-local, except perhaps when you have a generating module like a transliterator or morphological component that you won't want to re-run.

— Reply to this email directly or view it on GitHubhttps://github.com/moses-smt/mosesdecoder/issues/39#issuecomment-19330452 .

ugermann commented 11 years ago

We are currently working (among other things) on interactive MT, in which case you do want the writable Vocab (and other components) to be global so that information can be shared among threads. Regardless of whether you choose to have one dynamic Vocab per thread on a single shared one, the Vocab should offer two access functions. One declared const without locking, and one declared non-const with locking.

On Wed, Jun 12, 2013 at 3:44 PM, Juri Ganitkevitch <notifications@github.com

wrote:

I don't think there's actually any need to share OOV vocab items across threads, so the writable Vocab could just be thead-local, except perhaps when you have a generating module like a transliterator or morphological component that you won't want to re-run.

— Reply to this email directly or view it on GitHubhttps://github.com/moses-smt/mosesdecoder/issues/39#issuecomment-19330452 .

Ulrich Germann Research Associate School of Informatics University of Edinburgh

kpu commented 11 years ago

Why would randomized or continuous space LM be hard? All words come from the phrase table or are OOVs. So the LMs don't have to report their vocab if they don't want to. But the phrase table should so that the only runtime vocab lookup is the source sentence. If the LMs want to go const Factor * to string, that's just a method call anyway.

kpu commented 11 years ago

Uli, the object you're designing is either flawed or what we have now. It's not safe in general to do a const lookup at the same time as an add. So the object would have to do a read lock. Which is what is does now. The design goal is to not have a read lock at all.

ugermann commented 11 years ago

I'm talking about at least two distinct objects. One is read-only, global and never changes once it's been constructed. Each thread gets a Vocab const* pointer or Vocab const& reference to it, so calls to the lookup function will invoke the non-locking const member lookup function. In addition, there is one or more writable, locking dynamic object(s) (either one global or one per thread) that the respective thread will consult (with automatic addition) if the first lookup fails.

On Thu, Jun 13, 2013 at 2:45 AM, Kenneth Heafield notifications@github.comwrote:

Uli, the object you're designing is either flawed or what we have now. It's not safe in general to do a const lookup at the same time as an add. So the object would have to do a read lock. Which is what is does now. The design goal is to not have a read lock at all.

— Reply to this email directly or view it on GitHubhttps://github.com/moses-smt/mosesdecoder/issues/39#issuecomment-19366939 .

Ulrich Germann Research Associate School of Informatics University of Edinburgh

hieuhoang commented 11 years ago

everyone can chill the hell out. Whatever is it, it's not the vocab lock that's causing the slowdown so i ain't gonna do any of the suggestions.

threads baseline no vocab lock 1 326.5 296.750 5 87.75 10 58.75 58.25 15 61 20 68.5 68.25

bhaddow commented 11 years ago

That kind of makes things worse because now we have no hints as to what's causing the problem. We need some way of instrumenting the locking. A quick search turned up mutrace, and also one of the valgrind tools. Would be worth a try?

hieuhoang commented 11 years ago

yep, we had no idea what the problem was, and now we still have no idea, but it's not vocab locking. And now it's more likely to be something which is more difficult to fix, like malloc as phil said. Anything is worth a try, any more suggestions welcomed

dowobeha commented 11 years ago

Is there anything available for C++ similar to what JRat ( http://jrat.sourceforge.net) provides for Java? That is, an analysis of which functions take how much time and how frequently each function is called?

mjpost commented 11 years ago

Has anyone on the Moses team tried to replicate my curves? I can give you the model I used, although any old model should work, of course.

hieuhoang commented 11 years ago

ya, thx. scratch head

chart_1

bhaddow commented 11 years ago

Which model did you use for this?

kpu commented 11 years ago

Tetsuo suggests spinlocks. Since our lock areas are small this should help. Also, we should look into acquiring or writing a concurrent hash table.

Also, if people are being spammed by this, you can select unwatch thread from the page.

hieuhoang commented 11 years ago

hierarchical trained on europarl, fitered and binarized pt. It's on our syn server @ /home/s0565741/workspace/experiment/data/issues/locking/fr-en spinlock is a good idea but we need to know where the problem is 1st. It's not vocab

XapaJIaMnu commented 10 years ago

I have been investigating this bug in the past month or so. All tests were done on a machine with 32 cores (16 real and 16 virtual). Translating 3000 sentences from newstest2011.input.lc.1. Before testing everything related (lexical reordering, phrase tables, europarl) was put in RAM ( cat * > /dev/null)

Here is what I found: Phrase based moses also suffers from the problem IF phraseDictionaryOnDisk phrase table is used. Threads | Time(3000 sentences in minutes) 1 | 43m47s 2 | 24m9s 4 | 14m40s 8 | 11m34s 16 | 14m36s 32 | 13m54s Note the machine has only 16 real cores. The additional 16 are hyper threads so they shouldn't bring any significant speedup, yet we see slight improvement when using 32 threads.

Same test setting, but using phraseDictionaryCompact Threads | Time(3000 sentences in minutes) 1 | 26m21s 2 | 13m36s 4 | 7m01s 8 | 3m57s 16 | 2m22s 32 | 2m39s

This phrase table scales as expected with the increase of number of real cores and expectedly suffers where computation is extended to hyper threaded cores.

Based on those results I suspect hierarchical moses multithreading is not broken, but there is an issue with the particular phrase table used. Soon there is going to be support in phraseDictionaryCompact for hierarchical moses and then we can test that hypothesis.

Because phraseDictionaryOnDisk benefits when using hyperthreaded cores I suspect the issue might be in processor cache write that forces invalidation in other (physical) cpus. In hyperthreading execution a cache modification by a hyper thread will be seen automatically by the other thread running on the same CPU. (This is purely a speculation by me)

Unfortunately I am finding it very hard to conduct my performance testing as I require exclusive access to a machine with large number of cores, which is hard to get. If you have a free machine around and you can give me access, I can continue working on it.

emjotde commented 10 years ago

Nikolay, I have a machine with 64 physical cores standing in my living room. If Edinburgh does not have anything similar I can give you remote access for a while. It is not exactly free as other things are running on it, but I can pause them once you get around to testing.

hieuhoang commented 10 years ago

if might be a good idea to wait 'til we have your compact pt for syntax decoding. Then we can test pb and hiero at the same time, using the same pt implementations

pjwilliams commented 10 years ago

Hi,

although the results aren't directly comparable, here are some numbers for multithreaded string-to-tree decoding using the in-memory rule table. (String-to-tree and hierarchical phrase-based use the same code for decoding, though there are substantial differences in parsing effort, numbers of translation options, etc. between the two types of model.)

Threads | Time (500 sentences in seconds) 1 | 2570 2 | 1572 4 | 948 8 | 621 12 | 506 24 | 473

The numbers are from magni, which (I think) has 12 real cores and 24 hyper- threads. I don't know enough about multithreaded programming to say whether our implementation is "broken," or to what degree, though I'd guess there are plenty of places where the code is suboptimal in terms of contention, data proximity, and what-have-you.

I'll be very interested to see the phraseDictionaryCompact results for hierarchical phrase-based decoding when those experiments are run.

Phil

On 8 Aug 2014, at 17:12, Nikolay Bogoychev notifications@github.com wrote:

I have been investigating this bug in the past month or so. All tests were done on a machine with 32 cores (16 real and 16 virtual). Translating 3000 sentences from newstest2011.input.lc.1. Before testing everything related (lexical reordering, phrase tables, europarl) was put in RAM ( cat * > /dev/null)

Here is what I found: Phrase based moses also suffers from the problem IF phraseDictionaryOnDisk phrase table is used. Threads | Time(3000 sentences in minutes) 1 | 43m47s 2 | 24m9s 4 | 14m40s 8 | 11m34s 16 | 14m36s 32 | 13m54s Note the machine has only 16 real cores. The additional 16 are hyper threads so they shouldn't bring any significant speedup, yet we see slight improvement when using 32 threads.

Same test setting, but using phraseDictionaryCompact Threads | Time(3000 sentences in minutes) 1 | 26m21s 2 | 13m36s 4 | 7m01s 8 | 3m57s 16 | 2m22s 32 | 2m39s

This phrase table scales as expected with the increase of number of real cores and expectedly suffers where computation is extended to hyper threaded cores.

Based on those results I suspect hierarchical moses multithreading is not broken, but there is an issue with the particular phrase table used. Soon there is going to be support in phraseDictionaryCompact for hierarchical moses and then we can test that hypothesis.

Because phraseDictionaryOnDisk benefits when using hyperthreaded cores I suspect the issue might be in processor cache write that forces invalidation in other (physical) cpus. In hyperthreading execution a cache modification by a hyper thread will be seen automatically by the other thread running on the same CPU. (This is purely a speculation by me)

Unfortunately I am finding it very hard to conduct my performance testing as I require exclusive access to a machine with large number of cores, which is hard to get. If you have a free machine around and you can give me access, I can continue working on it.

— Reply to this email directly or view it on GitHub.

XapaJIaMnu commented 10 years ago

Okay, after the weekend of benchmarks, I have some results to report: It appears that phraseDictionaryCompact also slows down with the increase of number of threads

Threads | Time (3000 sentences) 1|50m17.446s 2|25m11.630s 4|12m50.097s 8|7m2.488s 16|4m36.476s 24|5m43.218s 32|6m31.476s

This however is a separate problem from the one that is exhibited by phraseDictionaryOnDisk.

When running phraseDictionaryOnDisk and examining function call times using perf, it turns out that the culprit is std::locale

This is moses running phraseDictionaryOnDisk with two threads: Two threads

4 threads.. Four threads

16 threads..

Sixteen

With a little googling it appears that this is a well known problem when using std::somethingstream. Apparently locale() carries a global lock that you can't easily get rid of. GCC 4.5 should have made things a bit better but only a bit. Here are couple of links describing similar problems:

CPP forums StackOverflow GCC bugzilla - Evidently not completely resolved...

A solution to the problem would be to move away from std::somethingstring in phraseDictionaryOnDisk and rely on something else for string operations that doesn't carry global locks.

emjotde commented 10 years ago

Hi, the issue with the compact phrase table is probably the internal cache, right? You can disable that by binarizing with the option: -encoding None -- Fat fingers, tiny buttons, and an overzealous spelling correction might have distorted this message-------- Wiadomość oryginalna --------Temat: Re: [mosesdecoder] Multithreading is broken for hierarchical Moses (#39)Od: Nikolay Bogoychev Do: moses-smt/mosesdecoder DW: Marcin Junczys-Dowmunt Okay, after the weekend of benchmarks, I have some results to report: It appears that phraseDictionaryCompact also slows down with the increase of number of threads

Threads | Time (3000 sentences) 1|50m17.446s 2|25m11.630s 4|12m50.097s 8|7m2.488s 16|4m36.476s 24|5m43.218s 32|6m31.476s

This however is a separate problem from the one that is exhibited by phraseDictionaryOnDisk.

When running phraseDictionaryOnDisk and examining function call times using perf, it turns out that the culprit is std::locale

This is moses running phraseDictionaryOnDisk with two threads:

4 threads..

16 threads..

With a little googling it appears that this is a well known problem when using std::somethingstream. Apparently locale() carries a global lock that you can't easily get rid of. GCC 4.5 should have made things a bit better but only a bit. Here are couple of links describing similar problems:

CPP forumsStackOverflowGCC bugzilla - Evidently not completely resolved...

A solution to the problem would be to move away from std::somethingstring in phraseDictionaryOnDisk and rely on something else for string operations that doesn't carry global locks.

—Reply to this email directly or view it on GitHub.

hieuhoang commented 10 years ago

what's the disadvantage of disabling the cache? Is it slower on small number of threads?

emjotde commented 10 years ago

The CompactPT cache is part of the dynamic programming algorithm for the smallest compression method. Currently, it is not thread-local (unless some one else did that), but I guess there is no reason to not make it thread-local. This would only affect translation speed in the case of unpruned phrase table with hundreds of thousands translation for "the" and the like.

W dniu 2014-08-19 01:08, Hieu Hoang napisał(a):

what's the disadvantage of disabling the cache? Is it slower on small number of threads?

Reply to this email directly or view it on GitHub [1].

Links:

[1] https://github.com/moses-smt/mosesdecoder/issues/39#issuecomment-52568035