Closed duduscript closed 6 years ago
I use 2 tesla M40 cards and I find GPU util is 0, is it the problem of memory is not enough?
The command in shell script is as follows:
$MARIAN/build/marian \
--model model/model.npz --type transformer \
--train-sets data/corpus.bpe.en data/corpus.bpe.fr \
--max-length 100 \
--vocabs model/vocab.enfr.yml model/vocab.enfr.yml \
--mini-batch-fit -w 10000 --maxi-batch 1000 \
--early-stopping 10 --cost-type=ce-mean-words \
--valid-freq 5000 --save-freq 5000 --disp-freq 500 \
--valid-metrics ce-mean-words perplexity translation \
--valid-sets data/valid.bpe.en data/valid.bpe.fr \
--valid-script-path ./scripts/validate.sh \
--valid-translation-output data/valid.bpe.en.output --quiet-translation \
--valid-mini-batch 64 \
--beam-size 6 --normalize 0.6 \
--log model/train.log --valid-log model/valid.log \
--enc-depth 6 --dec-depth 6 \
--transformer-heads 8 \
--transformer-postprocess-emb d \
--transformer-postprocess dan \
--transformer-dropout 0.1 --label-smoothing 0.1 \
--learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
--tied-embeddings-all \
--devices $GPUS --sync-sgd --seed 1111 \
--exponential-smoothing
My guess is there is something wrong with your ./script/validate.sh
. For instance, the script might wait for stdin or an argument blocking the validation.
Hi @snukky, there is my validation script:
#!/bin/bash
cat $1 \
| sed 's/\@\@ //g' \
| ../tools/moses-scripts/scripts/recaser/detruecase.perl 2>/dev/null \
| ../tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l fr 2>/dev/null \
| ../tools/moses-scripts/scripts/generic/multi-bleu-detok.perl data/valid.fr \
| sed -r 's/BLEU = ([0-9.]+),.*/\1/'
@snukky I am very sorry that I haven't find the right way to format my code..
I had tried to change the workspace to 8000 but no work
Do data/valid.bpe.en.output
and data/valid.fr
look OK? How large is your validation set?
This is a restarted training, right? The previous validation translations were successful?
@snukky Both data/valid.bpe.en.output and data/valid.fr look OK, there is less than 1M data in validation set. I tried both restarted training and continued training, but stoped after a validation every time.
Can you replace --valid-metrics ce-mean-words perplexity translation
with --valid-metrics ce-mean-words perplexity bleu
?
This will use the internal BLEU scorer on the segment data, so it will overestimate quality, but maybe we can exclude a few things if it works.
1M sentences is a huge validation set, so translation and postprocessing will take a while. Did you make sure that the output translation file contains all translated sentences? To debug, use bleu
metrics as @emjotde suggested and/or use a small validation set and set, for instance, --valid-freq 100
.
Is this 1M sentence or 1MB of total size in bytes?
@snukky @emjotde Sorry, it is 1MB of total size in bytes. I think it is the problem of my shell script. Now I can not access the training machine, I will clone this repo and train again tomorrow.
Hi @snukky @emjotde I tried again and there is the same problem. I train a en-fr model from start and it stops running after a validation. I run nvidia-smi command several times and the GPU util is always 0%. I am sure the process is stoped
./run-me.sh 0 1
And here is my script: run-me.sh:
#!/bin/bash -v
MARIAN=~/marian-dev/build
SRC=en
TRG=fr
ST=$SRC$TRG
# if we are in WSL, we need to add '.exe' to the tool names
if [ -e "/bin/wslpath" ]
then
EXT=.exe
fi
MARIAN_TRAIN=$MARIAN/marian$EXT
MARIAN_DECODER=$MARIAN/marian-decoder$EXT
MARIAN_VOCAB=$MARIAN/marian-vocab$EXT
MARIAN_SCORER=$MARIAN/marian-scorer$EXT
# set chosen gpus
GPUS=0
if [ $# -ne 0 ]
then
GPUS=$@
fi
echo Using GPUs: $GPUS
if [ ! -e $MARIAN_TRAIN ]
then
echo "marian is not installed in $MARIAN, you need to compile the toolkit first"
exit 1
fi
if [ ! -e ../tools/moses-scripts ] || [ ! -e ../tools/subword-nmt ] || [ ! -e ../tools/sacreBLEU ]
then
echo "missing tools in ../tools, you need to download them first"
exit 1
fi
if [ ! -e "data/corpus.$SRC" ]
then
./scripts/download-files.sh
fi
mkdir -p model
# preprocess data
if [ ! -e "data/corpus.bpe.$SRC" ]
then
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt13 -l $SRC-$TRG --echo src > data/valid.$SRC
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt13 -l $SRC-$TRG --echo ref > data/valid.$TRG
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt14 -l $SRC-$TRG --echo src > data/test2014.$SRC
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt15 -l $SRC-$TRG --echo src > data/test2015.$SRC
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt16 -l $SRC-$TRG --echo src > data/test2016.$SRC
./scripts/preprocess-data.sh
fi
# create common vocabulary
if [ ! -e "model/vocab.$ST.yml" ]
then
cat data/corpus.bpe.$SRC data/corpus.bpe.$TRG | $MARIAN_VOCAB --max-size 36000 > model/vocab.$ST.yml
fi
# train model
if [ ! -e "model/model.npz" ]
then
$MARIAN_TRAIN \
--model model/model.npz --type transformer \
--train-sets data/corpus.bpe.$SRC data/corpus.bpe.$TRG \
--max-length 100 \
--vocabs model/vocab.$ST.yml model/vocab.$ST.yml \
--mini-batch-fit -w 10000 --maxi-batch 1000 \
--early-stopping 10 --cost-type=ce-mean-words \
--valid-freq 5000 --save-freq 5000 --disp-freq 500 \
--valid-metrics ce-mean-words perplexity translation \
--valid-sets data/valid.bpe.$SRC data/valid.bpe.$TRG \
--valid-script-path "bash ./scripts/validate.sh" \
--valid-translation-output data/valid.bpe.$SRC.output --quiet-translation \
--valid-mini-batch 64 \
--beam-size 6 --normalize 0.6 \
--log model/train.log --valid-log model/valid.log \
--enc-depth 6 --dec-depth 6 \
--transformer-heads 8 \
--transformer-postprocess-emb d \
--transformer-postprocess dan \
--transformer-dropout 0.1 --label-smoothing 0.1 \
--learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
--tied-embeddings-all \
--devices $GPUS --sync-sgd --seed 1111 \
--exponential-smoothing
fi
# find best model on dev set
ITER=`cat model/valid.log | grep translation | sort -rg -k12,12 -t' ' | cut -f8 -d' ' | head -n1`
# translate test sets
for prefix in test2014 test2015 test2016
do
cat data/$prefix.bpe.$SRC \
| $MARIAN_DECODER -c model/model.npz.decoder.yml -m model/model.iter$ITER.npz -d $GPUS -b 12 -n -w 6000 \
| sed 's/\@\@ //g' \
| ../tools/moses-scripts/scripts/recaser/detruecase.perl \
| ../tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l $TRG \
> data/$prefix.$TRG.output
done
# calculate bleu scores on test sets
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt14 -l $SRC-$TRG < data/test2014.$TRG.output
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt15 -l $SRC-$TRG < data/test2015.$TRG.output
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt16 -l $SRC-$TRG < data/test2016.$TRG.output
download-files.sh:
#!/bin/bash -v
mkdir -p data
cd data
# get En-De training data for WMT17
wget -nc http://www.statmt.org/europarl/v7/fr-en.tgz
wget -nc http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
wget -nc http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz
# extract data
tar -xf fr-en.tgz
tar -xf training-parallel-commoncrawl.tgz
tar -xf training-parallel-nc-v12.tgz
# create corpus files
cat europarl-v7.fr-en.fr commoncrawl.fr-en.fr training/news-commentary-v12.fr-en.fr > corpus.fr
cat europarl-v7.fr-en.en commoncrawl.fr-en.en training/news-commentary-v12.fr-en.en > corpus.en
# clean
rm -r europarl-* commoncrawl.* training/ *.tgz
cd ..
preprocess-data.sh:
#!/bin/bash -v
# suffix of source language files
SRC=en
# suffix of target language files
TRG=fr
# number of merge operations
bpe_operations=32000
# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=../tools/moses-scripts
# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=../tools/subword-nmt
# tokenize
for prefix in corpus valid test2014 test2015 test2016
do
cat data/$prefix.$SRC \
| $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC \
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $SRC > data/$prefix.tok.$SRC
test -f data/$prefix.$TRG || continue
cat data/$prefix.$TRG \
| $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $TRG \
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $TRG > data/$prefix.tok.$TRG
done
# clean empty and long sentences, and sentences with high source-target ratio (training corpus only)
mv data/corpus.tok.$SRC data/corpus.tok.uncleaned.$SRC
mv data/corpus.tok.$TRG data/corpus.tok.uncleaned.$TRG
$mosesdecoder/scripts/training/clean-corpus-n.perl data/corpus.tok.uncleaned $SRC $TRG data/corpus.tok 1 100
# train truecaser
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/corpus.tok.$SRC -model model/tc.$SRC
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/corpus.tok.$TRG -model model/tc.$TRG
# apply truecaser (cleaned training corpus)
for prefix in corpus valid test2014 test2015 test2016
do
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$SRC < data/$prefix.tok.$SRC > data/$prefix.tc.$SRC
test -f data/$prefix.tok.$TRG || continue
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$TRG < data/$prefix.tok.$TRG > data/$prefix.tc.$TRG
done
# train BPE
cat data/corpus.tc.$SRC data/corpus.tc.$TRG | $subword_nmt/learn_bpe.py -s $bpe_operations > model/$SRC$TRG.bpe
# apply BPE
for prefix in corpus valid test2014 test2015 test2016
do
$subword_nmt/apply_bpe.py -c model/$SRC$TRG.bpe < data/$prefix.tc.$SRC > data/$prefix.bpe.$SRC
test -f data/$prefix.tc.$TRG || continue
$subword_nmt/apply_bpe.py -c model/$SRC$TRG.bpe < data/$prefix.tc.$TRG > data/$prefix.bpe.$TRG
done
validate.sh:
#!/bin/bash
cat $1 \
| sed 's/\@\@ //g' \
| ../tools/moses-scripts/scripts/recaser/detruecase.perl 2>/dev/null \
| ../tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l fr 2>/dev/null \
| ../tools/moses-scripts/scripts/generic/multi-bleu-detok.perl data/valid.fr \
| sed -r 's/BLEU = ([0-9.]+),.*/\1/'
marian command in marian-dev is compiled with profile flag, is it the problem?
@duduscript we dropped the ball on this one. Do you still have that problem?
I also observed what appeared like a dead-lock after validation, but in reality, it was some weird race condition that seemed to not detect when the data-prefetch thread had produced data, and would keep kicking it off over again. In my case, this led to either no log output of 4 hours (at which point our server farm killed it), or failing with std::bad_alloc.
This bug has since been fixed, but I don't yet see the change in the public master.
@duduscript, if during hanging in the dead-lock, the process keeps allocating memory, then with luck, that problem should go away after the next merge from our internal master.
Didn't this only happen after we started kicking the hornets nest that was prefetching? This is from before that time.
Hi @emjotde @frankseide , thank you. I solved this problem by using a shell script kill and restart the process when this problem happened. We use marian for benchmark work and this work was finished. This problem seems not affect the result for inference.
@duduscript so just to clarify, it was something inside the shell script that caused the hang?
@frankseide Memory is enough, it seems that memory is not the problem.
@emjotde I think it is the problem of shell script but I am not sure where it is.
No, the bug was that the prefetch mechanism would keep prefetching, adding more and more minibatches into the internal queue, but the foreground thread did not notice and kept telling the prefetch thread to keep going. Eventually, it would fill the entire available RAM and fail with out of memory. Anyway, if that was the problem, it will be solved soon, once we complete merging our latest internal master to public.
I think these might have been unrelated issues. Closing this then.
@frankseide Maybe you are right, I did not noticed the RAM usage change. Thank you.
I train transformer model with en-fr data, I run it for several times but it seems deadlock when finish a batch at every time, log is as follow
[2018-09-19 20:47:48] Training started [2018-09-19 20:47:48] [memory] Reserving 237 MB, device gpu0 [2018-09-19 20:47:48] [memory] Reserving 237 MB, device gpu1 [2018-09-19 20:47:48] Loading model from model/model.npz [2018-09-19 20:47:49] [memory] Reserving 237 MB, device cpu0 [2018-09-19 20:47:49] [memory] Reserving 118 MB, device gpu0 [2018-09-19 20:47:49] [memory] Reserving 118 MB, device gpu1 [2018-09-19 20:47:49] [memory] Reserving 237 MB, device gpu0 [2018-09-19 20:47:49] [memory] Reserving 237 MB, device gpu1 [2018-09-19 20:47:49] [memory] Reserving 118 MB, device gpu0 [2018-09-19 20:47:49] [memory] Reserving 118 MB, device gpu1 [2018-09-19 20:54:13] Ep. 1 : Up. 10500 : Sen. 2874986 : Cost 4.30 : Time 754.72s : 5910.11 words/s : L.r. 1.9688e-04 [2018-09-19 21:00:35] Ep. 1 : Up. 11000 : Sen. 3010777 : Cost 4.11 : Time 382.12s : 11597.02 words/s : L.r. 2.0625e-04 [2018-09-19 21:06:57] Ep. 1 : Up. 11500 : Sen. 3148646 : Cost 3.95 : Time 381.95s : 11534.43 words/s : L.r. 2.1563e-04 [2018-09-19 21:13:20] Ep. 1 : Up. 12000 : Sen. 3281766 : Cost 3.84 : Time 382.49s : 11587.70 words/s : L.r. 2.2500e-04 [2018-09-19 21:19:36] Ep. 1 : Up. 12500 : Sen. 3417524 : Cost 3.75 : Time 376.79s : 11559.61 words/s : L.r. 2.3438e-04 [2018-09-19 21:25:55] Ep. 1 : Up. 13000 : Sen. 3554128 : Cost 3.68 : Time 378.95s : 11500.20 words/s : L.r. 2.4375e-04 [2018-09-19 21:32:23] Ep. 1 : Up. 13500 : Sen. 3694291 : Cost 3.61 : Time 387.31s : 11723.47 words/s : L.r. 2.5313e-04 [2018-09-19 21:38:42] Ep. 1 : Up. 14000 : Sen. 3830735 : Cost 3.60 : Time 379.74s : 11483.31 words/s : L.r. 2.6250e-04 [2018-09-19 21:45:05] Ep. 1 : Up. 14500 : Sen. 3967136 : Cost 3.55 : Time 382.41s : 11608.55 words/s : L.r. 2.7188e-04 [2018-09-19 21:51:27] Ep. 1 : Up. 15000 : Sen. 4104151 : Cost 3.53 : Time 381.73s : 11533.49 words/s : L.r. 2.8125e-04 [2018-09-19 21:51:27] Saving model weights and runtime parameters to model/model.npz.orig.npz [2018-09-19 21:51:28] Saving model weights and runtime parameters to model/model.iter15000.npz [2018-09-19 21:51:29] Saving model weights and runtime parameters to model/model.npz [2018-09-19 21:51:30] Saving Adam parameters to model/model.npz.optimizer.npz [2018-09-19 21:51:38] [valid] Ep. 1 : Up. 15000 : ce-mean-words : 2.28124 : new best [2018-09-19 21:51:41] [valid] Ep. 1 : Up. 15000 : perplexity : 9.78885 : new best