marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.23k stars 230 forks source link

Translating a file #26

Closed mehmedes closed 7 years ago

mehmedes commented 7 years ago

Is it possible to translate a file consisting of one sentence per line?

I tried

./amun -c config.ens.yml -i source.txt > target.txt

but ended up with no translation. target.txt was created but didn't contain any content.

hieuhoang commented 7 years ago

try ./amun -c config.ens.yml < source.txt > target.txt until we fix it

hieuhoang commented 7 years ago

fixed https://github.com/emjotde/amunmt/commit/27d6fb9014f03a5f01fb80b570e9a18600cd4ec0

mehmedes commented 7 years ago

Great! Works perfectly!

mehmedes commented 7 years ago

Is there anything I need to consider when translating a file with several hundred MB?

The translation crashes stating

Killed
emjotde commented 7 years ago

This probably reads your whole file into memory and then get killed due to filling it. @hieuhoang shouldn't maxi-batch take care of this?

hieuhoang commented 7 years ago

more of the job of the queuing mechanism, or it could be a leak. I've replicated the problem

this is gonna take a while to sort out

hieuhoang commented 7 years ago

@mehmedes - how many lines does it translate before it crashes and how approx much memory did it use before it crashed? How much mem does your machine have?

Can you make your input & models available for download?

kpu commented 7 years ago

https://github.com/emjotde/amunmt/blob/master/src/common/decoder_main.cpp#L59

You want a bounded producer-consumer queue on the ThreadPool.

hieuhoang commented 7 years ago

i can't see the problem with L59.

aye - bounded queue. Know of any we can steal?

emjotde commented 7 years ago

About L59, what happens with sentences when it is smaller than maxiBatch?

mehmedes commented 7 years ago

It translates 20 lines and uses 87 % MEM before it crashes. I run a virtual machine with 7.2 GB RAM I use Rico's WMT16 de->en translation model

wget -r --cut-dirs=2 -e robots=off -nH -np -R index.html* http://data.statmt.org/rsennrich/wmt16_systems/de-en/

My configs are:

allow-unk: false
batch-size: 1
beam-size: 12
bpe:
  - /home/sariyildiznureddin/de-en/deen.bpe
bunch-size: 1
cpu-threads: 8
devices: [0]
gpu-threads: 0
n-best: false
no-debpe: false
normalize: false
relative-paths: true
return-alignment: false
scorers:
  F0:
    path: /home/sariyildiznureddin/de-en/model-ens1.npz
    type: Nematus
  F1:
    path: /home/sariyildiznureddin/de-en/model-ens2.npz
    type: Nematus
  F2:
    path: /home/sariyildiznureddin/de-en/model-ens3.npz
    type: Nematus
  F3:
    path: /home/sariyildiznureddin/de-en/model-ens4.npz
    type: Nematus
show-weights: false
softmax-filter:
  []
source-vocab:
  - /home/sariyildiznureddin/de-en/vocab.de.json
target-vocab: /home/sariyildiznureddin/de-en/vocab.en.json
weights:
  F0: 1
  F1: 1
  F2: 1
  F3: 1
wipo: false
emjotde commented 7 years ago

@mehmedes your commands are fine. This is our bug to fix, hold on :)

kpu commented 7 years ago

I have a bounded threadpool, but it's not C++11 style. Modifying the one you're using.

hieuhoang commented 7 years ago

@kpu cheers @emjotde - it reads in another line. If EOF then it goes to L70. Seems to work ok @mehmedes

  1. i've not tested amunmt with multiple models, there may be bugs. Can you try with just 1. It def shouldn't crash after 20 sentences.
  2. Can you replace batch-size: 1 with the following: mini-batch: 100 maxi-batch: 2000
emjotde commented 7 years ago

@hieuhoang He is using the CPU version, so mini-batches won't work.

hieuhoang commented 7 years ago

i know, just making sure

Hieu Hoang http://moses-smt.org/

On 24 January 2017 at 15:20, Marcin Junczys-Dowmunt < notifications@github.com> wrote:

@hieuhoang https://github.com/hieuhoang He is using the CPU version, so mini-batches won't work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/emjotde/amunmt/issues/26#issuecomment-274833789, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqOFLWFiVYA13BBS4P9ZuBQkFw0M5kvks5rVha0gaJpZM4LsKUD .

kpu commented 7 years ago

See if I fixed it in 20a04ea9853c460752f05c811aa6f45361f14df2.

mehmedes commented 7 years ago

@hieuhoang I get

terminate called after throwing an instance of 'YAML::ParserException'
  what():  yaml-cpp: error at line 21, column 9: illegal map value
Aborted (core dumped)

@kpu I'll try.

mehmedes commented 7 years ago

Looks good. So far no crash!

hieuhoang commented 7 years ago

yep, me too. Mem doesn't grow. No slowdown

mehmedes commented 7 years ago

The translation crashed after 5300 lines. I had it running for 4 hours.

Would this now be due to my low memory?

hieuhoang commented 7 years ago

can you please make your files available for download so we can replicate it

mehmedes commented 7 years ago

I used the following model...

wget -r --cut-dirs=2 -e robots=off -nH -np -R index.html* http://data.statmt.org/rsennrich/wmt16_systems/de-en/

...to translate this file...

wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2007.de.shuffled.gz

with this shell command...

#!/bin/sh

# this sample script translates a test set, including
# preprocessing (tokenization, truecasing, and subword segmentation),
# and postprocessing (merging subword units, detruecasing, detokenization).

# instructions: set paths to mosesdecoder, subword_nmt, and nematus,
# then run "./translate.sh < input_file > output_file"

# suffix of source language
SRC=de

# suffix of target language
TRG=en

# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=/home/sariyildiznureddin/mosesdecoder

# preprocess
$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l $SRC | \
$mosesdecoder/scripts/recaser/truecase.perl -model truecase-model.$SRC | \

# translate
/home/sariyildiznureddin/amunmt/build/bin/amun -c /home/sariyildiznureddin/amunmt/build/bin/config.ens.yml | \

sed 's/\@\@ //g' | \
$mosesdecoder/scripts/recaser/detruecase.perl | \
$mosesdecoder/scripts/tokenizer/detokenizer.perl -l $TRG -penn

... and these configs ...

allow-unk: false
batch-size: 1
beam-size: 12
bpe:
  - /home/sariyildiznureddin/de-en/deen.bpe
bunch-size: 1
cpu-threads: 8
devices: [0]
gpu-threads: 0
n-best: false
no-debpe: false
normalize: false
relative-paths: true
return-alignment: false
scorers:
  F0:
    path: /home/sariyildiznureddin/de-en/model-ens1.npz
    type: Nematus
  F1:
    path: /home/sariyildiznureddin/de-en/model-ens2.npz
    type: Nematus
  F2:
    path: /home/sariyildiznureddin/de-en/model-ens3.npz
    type: Nematus
  F3:
    path: /home/sariyildiznureddin/de-en/model-ens4.npz
    type: Nematus
show-weights: false
softmax-filter:
  []
source-vocab:
  - /home/sariyildiznureddin/de-en/vocab.de.json
target-vocab: /home/sariyildiznureddin/de-en/vocab.en.json
weights:
  F0: 1
  F1: 1
  F2: 1
  F3: 1
wipo: false

:D

hieuhoang commented 7 years ago

looks like it uses at least 6GB. If you machine has 8GB, I would not be surprised that it crash due to running out of memory on a particularly long sentence

mehmedes commented 7 years ago

Ok. Thanks! I'll increase the RAM and have a new run

hieuhoang commented 7 years ago

I've been running your setup for 9 hours on my server with 128GB ram, 12 1.7GHz cores. Still going, translated 25k sentences. Currently using 8.5GB ram