Closed mehmedes closed 7 years ago
try ./amun -c config.ens.yml < source.txt > target.txt until we fix it
Great! Works perfectly!
Is there anything I need to consider when translating a file with several hundred MB?
The translation crashes stating
Killed
This probably reads your whole file into memory and then get killed due to filling it. @hieuhoang shouldn't maxi-batch
take care of this?
more of the job of the queuing mechanism, or it could be a leak. I've replicated the problem
this is gonna take a while to sort out
@mehmedes - how many lines does it translate before it crashes and how approx much memory did it use before it crashed? How much mem does your machine have?
Can you make your input & models available for download?
https://github.com/emjotde/amunmt/blob/master/src/common/decoder_main.cpp#L59
You want a bounded producer-consumer queue on the ThreadPool.
i can't see the problem with L59.
aye - bounded queue. Know of any we can steal?
About L59, what happens with sentences when it is smaller than maxiBatch?
It translates 20 lines and uses 87 % MEM before it crashes. I run a virtual machine with 7.2 GB RAM I use Rico's WMT16 de->en translation model
wget -r --cut-dirs=2 -e robots=off -nH -np -R index.html* http://data.statmt.org/rsennrich/wmt16_systems/de-en/
My configs are:
allow-unk: false
batch-size: 1
beam-size: 12
bpe:
- /home/sariyildiznureddin/de-en/deen.bpe
bunch-size: 1
cpu-threads: 8
devices: [0]
gpu-threads: 0
n-best: false
no-debpe: false
normalize: false
relative-paths: true
return-alignment: false
scorers:
F0:
path: /home/sariyildiznureddin/de-en/model-ens1.npz
type: Nematus
F1:
path: /home/sariyildiznureddin/de-en/model-ens2.npz
type: Nematus
F2:
path: /home/sariyildiznureddin/de-en/model-ens3.npz
type: Nematus
F3:
path: /home/sariyildiznureddin/de-en/model-ens4.npz
type: Nematus
show-weights: false
softmax-filter:
[]
source-vocab:
- /home/sariyildiznureddin/de-en/vocab.de.json
target-vocab: /home/sariyildiznureddin/de-en/vocab.en.json
weights:
F0: 1
F1: 1
F2: 1
F3: 1
wipo: false
@mehmedes your commands are fine. This is our bug to fix, hold on :)
I have a bounded threadpool, but it's not C++11 style. Modifying the one you're using.
@kpu cheers @emjotde - it reads in another line. If EOF then it goes to L70. Seems to work ok @mehmedes
@hieuhoang He is using the CPU version, so mini-batches won't work.
i know, just making sure
Hieu Hoang http://moses-smt.org/
On 24 January 2017 at 15:20, Marcin Junczys-Dowmunt < notifications@github.com> wrote:
@hieuhoang https://github.com/hieuhoang He is using the CPU version, so mini-batches won't work.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/emjotde/amunmt/issues/26#issuecomment-274833789, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqOFLWFiVYA13BBS4P9ZuBQkFw0M5kvks5rVha0gaJpZM4LsKUD .
See if I fixed it in 20a04ea9853c460752f05c811aa6f45361f14df2.
@hieuhoang I get
terminate called after throwing an instance of 'YAML::ParserException'
what(): yaml-cpp: error at line 21, column 9: illegal map value
Aborted (core dumped)
@kpu I'll try.
Looks good. So far no crash!
yep, me too. Mem doesn't grow. No slowdown
The translation crashed after 5300 lines. I had it running for 4 hours.
Would this now be due to my low memory?
can you please make your files available for download so we can replicate it
I used the following model...
wget -r --cut-dirs=2 -e robots=off -nH -np -R index.html* http://data.statmt.org/rsennrich/wmt16_systems/de-en/
...to translate this file...
wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2007.de.shuffled.gz
with this shell command...
#!/bin/sh
# this sample script translates a test set, including
# preprocessing (tokenization, truecasing, and subword segmentation),
# and postprocessing (merging subword units, detruecasing, detokenization).
# instructions: set paths to mosesdecoder, subword_nmt, and nematus,
# then run "./translate.sh < input_file > output_file"
# suffix of source language
SRC=de
# suffix of target language
TRG=en
# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=/home/sariyildiznureddin/mosesdecoder
# preprocess
$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l $SRC | \
$mosesdecoder/scripts/recaser/truecase.perl -model truecase-model.$SRC | \
# translate
/home/sariyildiznureddin/amunmt/build/bin/amun -c /home/sariyildiznureddin/amunmt/build/bin/config.ens.yml | \
sed 's/\@\@ //g' | \
$mosesdecoder/scripts/recaser/detruecase.perl | \
$mosesdecoder/scripts/tokenizer/detokenizer.perl -l $TRG -penn
... and these configs ...
allow-unk: false
batch-size: 1
beam-size: 12
bpe:
- /home/sariyildiznureddin/de-en/deen.bpe
bunch-size: 1
cpu-threads: 8
devices: [0]
gpu-threads: 0
n-best: false
no-debpe: false
normalize: false
relative-paths: true
return-alignment: false
scorers:
F0:
path: /home/sariyildiznureddin/de-en/model-ens1.npz
type: Nematus
F1:
path: /home/sariyildiznureddin/de-en/model-ens2.npz
type: Nematus
F2:
path: /home/sariyildiznureddin/de-en/model-ens3.npz
type: Nematus
F3:
path: /home/sariyildiznureddin/de-en/model-ens4.npz
type: Nematus
show-weights: false
softmax-filter:
[]
source-vocab:
- /home/sariyildiznureddin/de-en/vocab.de.json
target-vocab: /home/sariyildiznureddin/de-en/vocab.en.json
weights:
F0: 1
F1: 1
F2: 1
F3: 1
wipo: false
:D
looks like it uses at least 6GB. If you machine has 8GB, I would not be surprised that it crash due to running out of memory on a particularly long sentence
Ok. Thanks! I'll increase the RAM and have a new run
I've been running your setup for 9 hours on my server with 128GB ram, 12 1.7GHz cores. Still going, translated 25k sentences. Currently using 8.5GB ram
Is it possible to translate a file consisting of one sentence per line?
I tried
but ended up with no translation. target.txt was created but didn't contain any content.