marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.2k stars 227 forks source link

GPU translation currently broken? #27

Closed bhaddow closed 7 years ago

bhaddow commented 7 years ago

It outputs 3 lines then exits, no error message. Script found in /home/bhaddow/experiments/wmt17/backtrans/bench/translate-gpu.sh

Configuration (tried changing thread and batch settings)

Paths are relative to config file location

relative-paths: yes

performance settings

beam-size: 5 normalize: yes gpu-threads: 4 cpu-threads: 2

mini-batch: 60

maxi-batch: 1200

scorer configuration

scorers: F0: path: model.npz type: Nematus

scorer weights

weights: F0: 1.0

vocabularies

source-vocab: vocab.en.json target-vocab: vocab.cs.json

emjotde commented 7 years ago

Confirmed. Guys, we need regression tests. I am close to banning all commits to master for a while and reverting back to code before the batch decoding. It's good to have people trying current code, but we are breaking basic functions too often.

emjotde commented 7 years ago

@bhaddow revision 63c70c91b29b35220c2f89d648658a8ab5103cb3 seems to be fine for GPU decoding. The new batch-decoding works for instance with parameters:

--beam-size 5 --mini-batch 100 --maxi-batch 1000

Which will preload 10 mini-batches and process them by length.

hieuhoang commented 7 years ago

Agreed. Can base the regression on pre-batch results

Hieu Sent while bumping into things

On 24 Jan 2017 5:15 p.m., "Marcin Junczys-Dowmunt" notifications@github.com wrote:

Confirmed. Guys, we need regression tests. I am close to banning all commits to master for a while and reverting back to code before the batch decoding. It's good to have people trying current code, but we are breaking basic functions too often.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/emjotde/amunmt/issues/27#issuecomment-274871495, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqOFErfBXYnGYP6GKWAXxtQuN4B54QEks5rVjGxgaJpZM4LshzK .

emjotde commented 7 years ago

Branch dev for all new stuff. I would freeze master for a while after we fix this issue.

emjotde commented 7 years ago

@bhaddow Weird, recompilation seems to help?

bhaddow commented 7 years ago

I deleted CMakeFiles CMakeCache.txt cmake_install.cmake and bin from my build directory, reran cmake and make, but I still have the problem.

However I confirm that 63c70c9 is fine.

emjotde commented 7 years ago

Your yaml file is also a bit weird. Set cpu-threads to 0, just keep the gpu-threads at 1 or more. I am thinking of reverting master to 63c70c9 for now. The current version is in branch dev.

hieuhoang commented 7 years ago

can you not revert yet. Let me retest it, pretty sure GPU translation is working, I check it for speed and accuracy every time I check in

emjotde commented 7 years ago

I think the problem is the use of cpu-threads and gpu-threads together. That seems to break stuff for me.

hieuhoang commented 7 years ago

that's unlikely to work with batching so the problem won't go away by reverting.

The issue is how we support different devices that supports different batching (eg GPUs with different RAM and cores, CPU).

emjotde commented 7 years ago

It even breaks with mini-batch 1 and maxi-batch 1. Maybe we should disable it? Either use CPU or GPU?

hieuhoang commented 7 years ago

lemme check

bhaddow commented 7 years ago

So why have the two options if they don't work together? Just --threads was fine.

bhaddow commented 7 years ago

And how do you get a minibatch of 100 to work? On a 1080, it works with 10, but crashes (memory) with 20. This is with 1 gpu thread and no cpu threads.

emjotde commented 7 years ago

They used to work before we had batching. I am still thinking they should work, it would just require more complex scheduling. However, I am not sure using them together actually makes sense, when the GPU is 50x faster than the CPU. You are more likely to introduce slow-downs than speed-ups.

Having them was meant to be able to use a pure CPU version or GPU version with the same binary. You can do --cpu-threads 16 and --gpu-threads 0, this will only use the CPU. Then you can have --cpu-threads 0 and --gpu-threads 3 which uses only gpu and can result in minor speed-ups for gpu compared to single thread.

hieuhoang commented 7 years ago

They should work together. Eventually, we prob have to bite the bullet and have things like batch size & threads to be device-specific params

bhaddow commented 7 years ago

I understand now. It wasn't clear to me why the threads parameter got split into two, and I thought boosting both might help. imho it would be clearer if --device was used to specify the device and --threads to specify the number of threads - unless there really is a use-case for multiple cpu and gpu threads.

emjotde commented 7 years ago

Maybe we should switch to string devices? --device gpu0,gpu1,gpu2,gpu3 or --device cpu, then --threads would be threads per device. Forbidding mixing gpu0,cpu.

kpu commented 7 years ago

GNU parallel --sshlogin style?
gpu0/4,gpu1/3

emjotde commented 7 years ago

Do we need to be able to specify threads in such a fine-grained way? --device gpu0,gpu1 --threads 3 would mean 6 actual threads. Confusing?

For the pure CPU version (when compiled without cuda support) we can just do --threads and not display a device option.

hieuhoang commented 7 years ago

+1 for @emjotde idea. We can have CPU + GPU of different kinds if we do the following - keep maxi-batch a global param but mini-batch is device-specific. eg

--device GPU0, GPU1, CPU --threads 1,1,32 --mini-batch 100,20,1

bhaddow commented 7 years ago

This is starting to look like a Moses command line that you end up having to write a perl script to generate. Most people don't want to mix CPUs and GPUs, and want the same number of threads and batch size on each GPU. So use @emjotde's original suggestion -- with perhaps an advanced formulation for heterogeneous environments -- if you really want to support them.

emjotde commented 7 years ago

@bhaddow has a point. Let's keep it simple.

hieuhoang commented 7 years ago

It will have to be implemented eventually when we buy new GPUs to slot into existing PC with old GPUs. Supporting CPUs alongside GPUs is a proxy for the heteroegenous environment.

Having an advanced formulation isn't really an option 'cos we'll then have 2 formats to look after

emjotde commented 7 years ago

on zisa:

cd /home/marcin/un/en-fr.marian
/home/marcin/amunmt/build/bin/amun -m ../en-fr/model.npz.dev.npz -s train.tok.clean.bpe.en.json -t train.tok.clean.bpe.fr.json -n -b 12 -d 1 --maxi-batch 1 --mini-batch 1 -i dev.tok.bpe.en

gives me a segfault with current master. Inside Transpose?

hieuhoang commented 7 years ago

Barry's models work ok with 1 gpu, 12 cpu, 1 cpu + 12 gpu. It doesn't work with 2 GPUs, but then that gave no speed benefit anyway 'cos batching is already saturating the cores. but is should still work - i'll debug it later.

1 GPU is 30 times faster than 12 CPU cores. About 10 sentences differ in 10k sentences

emjotde commented 7 years ago

Still, something is very wrong, see segfault above.

hieuhoang commented 7 years ago

i'll take a look tmrw

emjotde commented 7 years ago

Actually, that might be because someone is hogging all the GPUs on zisa, compeletly filled to the rim. Tensorflow, I presume.

kpu commented 7 years ago

That would be a bug in TensorFlow ;-). I've killed the job since it was hogging GPUs and not using them.

emjotde commented 7 years ago

OK. No segfault now. Thanks. Lesson: we need better CUDA error reporting :)

bhaddow commented 7 years ago

I'm now thinking it was a silent out-of-memory error. --gpu-threads 4 --mini-batch 100 causes the silent exit Increasing mini-batch to 1000 gives a bad_alloc. Doesn't really explain why the issue was fixed when I revert to an earlier version. Unless it was some coincidental collisions with our TensorSlow friend.

emjotde commented 7 years ago

reduce gpu-threads to 1, potential speed up with batching is minimal, but it probably multiplies peak memory usage.

hieuhoang commented 7 years ago

@emjotde the zisa model works for me on my machine if I change -d 1 -> -d 0

emjotde commented 7 years ago

@hieuhoang yes, it was a memory problem. We should add checks on allocation, so it will throw an error not just segfault.

hieuhoang commented 7 years ago

agreed. I'll use my little laptop

hieuhoang commented 7 years ago

was finking. Adding allocation checks doesn't solve the underlying problem of failure during translation. Ideally, GPU-amun should NEVER have a runtime memory problem during translation otherwise it would be unpredictable in production. We can do this by only allocating memory during init.

Most of the model data is allocated @ init but there are some significant mem alloc for each batch, especially shared memory and thread-specific memory.

emjotde commented 7 years ago

Define significant. Shared memory is usually a couple of bytes per thread, also doesn't that come from an entirely different pool?

What we need at least is stuff like that around cuda invocations: https://github.com/emjotde/Marian/blob/master/src/tensors/tensor_gpu.h#L37

hieuhoang commented 7 years ago

You're right. A lot of the share mem alloc varies with batch size, it's more of a couple of KB now rather than B. However, it's not the cause of mem issues when I use barry's model with mini-batch=40

I found the cause for the crash - it's not shared mem but global mem. GPU/matrix::Resize is being called with row=499200, cols=2048, which is 7.6GB. I traced it back to the Alignment::GetAlignedSourceContext() L135 Broadcast(). Basically, the matrix size is dependent on mini-batch and source length.

The only way I can think of is to have an option of a fixed matrix, a max input length. Any input above that is truncated or causes a throw.

emjotde commented 7 years ago

Actually I just fixed that in Marian yesterday. We can do it the same way: We can do the broadcast virtually with a simultaneous reduce, that way this huge matrix is never constructed but immediately reduced. For that you need to factor the following vector multiplication into that. We will fix that together with Tomasz. Will also make it a good deal faster.

hieuhoang commented 7 years ago

ah ok. Another solution would be variable sized mini-batches. For another time

emjotde commented 7 years ago

On the other hand, how does the number 499200 come to be? What's the sentence length, beam size, batch-size factorization of this?

hieuhoang commented 7 years ago

err, good question - there's a sentence of length 1040. We should protect ourselves against that

emjotde commented 7 years ago

slicing and dicing.

hieuhoang commented 7 years ago

mini-batch 1

hieuhoang commented 7 years ago

discard

bhaddow commented 7 years ago

Please don't discard - that will really screw things up. But yes, I shouldn't have given it that sentence in the first place.

On 26 Jan 2017 18:16, "Hieu Hoang" notifications@github.com wrote:

discard

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/emjotde/amunmt/issues/27#issuecomment-275466415, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8mG_wqlclpLvRvXPcK7-7Bwsmf78eUks5rWOMKgaJpZM4LshzK .

emjotde commented 7 years ago

I would cut it into max-length pieces somehow and translate piece by piece? Messes up bookkeeping a bit.

hieuhoang commented 7 years ago

Should we truncate? even to 0 length perhaps. It's an obvious garbage input and pointless to try to translate it. fyi, the offending line is

As for the possibility of running against Hillary Clinton , Ru@@ bio was quick to criticize the presump@@ tive Democratic nom@@ inee . \ n \ u@@ 00@@ 22 , " " duration " : 127 , " date " : " 2015 @-@ 01 @-@ 12 18 : 50 : 00 , " " key@@ words " : " Video , News , CBS , Marco Ru@@ bio , , " " image " : { " path " : " http : \ / \ / c@@ b@@ s@@ news@@ 2.@@ c@@ b@@ si@@ stat@@ ic@@ .com \ / hub \ / i \ / r \ / 2015 \ / 01 \ / 13 \ / 7@@ f@@ d@@ 70@@ 6@@ 68 @-@ e@@ eff @-@ 45@@ f@@ 0 @-@ b@@ 4@@ c@@ 3 @-@ 5@@ b@@ 3@@ b@@ 2@@ b@@ 2@@ b@@ 94@@ 57 \ / thumbnail \ / 140@@ x@@ 90 \ / 1@@ e@@ 1@@ d@@ 08@@ 29@@ d@@ c@@ f@@ 5@@ a@@ 65@@ d@@ 26@@ fac@@ 6@@ da@@ 2@@ b@@ 7@@ a@@ 9@@ de@@ 4 \ / em@@ 01@@ 12@@ cor@@ des@@ 3@@ 30@@ 4@@ 29@@ 6@@ 40@@ x@@ 36@@ 0.@@ j@@ pg , " " full " : " http : \ / \ / c@@ b@@ s@@ news@@ 1.@@ c@@ b@@ si@@ stat@@ ic@@ .com \ / hub \ / i \ / r \ / 2015 \ / 01 \ / 13 \ / 7@@ f@@ d@@ 70@@ 6@@ 68 @-@ e@@ eff @-@ 45@@ f@@ 0 @-@ b@@ 4@@ c@@ 3 @-@ 5@@ b@@ 3@@ b@@ 2@@ b@@ 2@@ b@@ 94@@ 57 \ / thumbnail \ / 9@@ 40@@ x@@ 470 \ / 6@@ a@@ 16@@ 68@@ 85@@ f@@ 4@@ d@@ 7@@ 114@@ 6@@ f@@ 25@@ 10@@ a@@ 0@@ c@@ f@@ 7@@ ef@@ 47@@ 12 \ / em@@ 01@@ 12@@ cor@@ des@@ 3@@ 30@@ 4@@ 29@@ 6@@ 40@@ x@@ 36@@ 0.@@ j@@ pg " } , " show@@ Style " : " video @-@ logo @-@ evening @-@ news , " " suppress " : false , " season " : " 1 , " " episode " : " 1 , " " mp@@ x@@ Re@@ f@@ Id " : " nc@@ O@@ t n@@ ET@@ br@@ HY@@ TT@@ Ab@@ X@@ 1@@ DEV@@ Q@@ 5@@ TJ@@ r@@ L@@ 1@@ mr@@ c@@ M , " " segment " : 1 , " i@@ pad@@ Block " : false , " topic " : " evening @-@ news , " " topic name " : " CBS Even@@ ing News , " " topic parent slug " : " evening @-@ news , " " topic parent " : " CBS Even@@ ing News , " " collection " : null , " pid " : " p@@ x@@ I@@ 2@@ b@@ 1@@ R@@ NAT@@ Es " } , " dz@@ Wo@@ R@@ D@@ 3@@ H@@ 8@@ ROM " : { " id " : " 3@@ b@@ 48@@ 76@@ f@@ 5 @-@ 29@@ f@@ 0 @-@ 4@@ aa@@ 9 @-@ 9@@ 29@@ a @-@ 43@@ d@@ 800@@ f@@ 39@@ 1@@ f@@ 8 , " " medi@@ as " : { " desktop " : { " pid " : " dz@@ Wo@@ R@@ D@@ 3@@ H@@ 8@@ ROM , " " bit@@ rate " : " 7@@ 64@@ 000 , " " uri " : " rt@@ mp : \ / \ / c@@ p@@ 98@@ 36@@ 3.@@ ed@@ ge@@ f@@ cs.@@ net \ / on@@ demand \ / ? au@@ th = c@@ bs \ u@@ 00@@ 26@@ ai@@ f@@ p = v@@ 001 \ u@@ 00@@ 26@@ sli@@ st = media \ / 2015 \ / 01 \ / 12 \ / 38@@ 3@@ 40@@ 30@@ 7@@ 58@@ 43 \ / \ u@@ 00@@ 3@@ C@@ break \ u@@ 00@@ 3@@ E@@ media \ / 2015 \ / 01 \ / 12 \ / 38@@ 3@@ 40@@ 30@@ 7@@ 58@@ 43 \ / en la@@ po@@ ok 3@@ 30@@ 407 79@@ 6.@@ mp@@ 4 " } , " tablet " : { " pid " : " X@@ O@@ 2@@ U@@ pi@@ P@@ f@@ 27 S , " " bit@@ rate " : " 7@@ 40@@ 000 , " " uri " : " http : \ / \ / down@@ lo@@ ad@@ .c@@ b@@ s@@ new@@ s.com \ / media \ / mp@@ x \ / 2015 \ / 01 \ / 12 \ / 38@@ 3@@ 40@@ 30@@ 7@@ 58@@ 43 \ / en la@@ po@@ ok 3@@ 30@@ 407 74@@ 0.@@ mp@@ 4 " } , " mobile " : { " pid " : " On@@ Y@@ V@@ R@@ CL@@ W@@ g@@ DR , " " bit@@ rate " : " 24@@ 0000 , " " uri " : " http : \ / \ / down@@ lo@@ ad@@ .c@@ b@@ s@@ new@@ s.com \ / media \ / mp@@ x \ / 2015 \ / 01 \ / 12 \ / 38@@ 3@@ 40@@ 30@@ 7@@ 58@@ 43 \ / en la@@ po@@ ok 3@@ 30@@ 407 24@@ 0.@@ mp@@ 4 " } , " i@@ os " : { " pid " : " B@@ kn@@ 9@@ D@@ 9@@ G@@ sy@@ GT@@ t , " " bit@@ rate " : " 500@@ 000 , " " uri " : " http : \ / \ / i@@ pad @-@ stream@@ ing@@ .c@@ b@@ s@@ new@@ s.com \ / media \ / mp@@ x \ / 2015 \ / 01 \ / 12 \ / 38@@ 3@@ 40@@ 30@@ 7@@ 58@@ 43 \ / en la@@ po@@ ok 3@@ 30@@ 421 500 \ / en la@@ po@@ ok 3@@ 30@@ 421 500@@ .@@ m@@ 3@@ u@@ 8 " } } , " slug " : " dis@@ ney@@ land @-@ meas@@ les @-@ outbreak @-@ spreads , " " title " : " Disneyland meas@@ les outbreak spreads , " " dek " : " The people sick@@ ened were all at Disneyland between December 15 @-@ 20 .

emjotde commented 7 years ago

This is important information about Disneyland!