deadlock in training - Githubissues

duduscript commented 6 years ago

I train transformer model with en-fr data, I run it for several times but it seems deadlock when finish a batch at every time, log is as follow

[2018-09-19 20:47:48] Training started [2018-09-19 20:47:48] [memory] Reserving 237 MB, device gpu0 [2018-09-19 20:47:48] [memory] Reserving 237 MB, device gpu1 [2018-09-19 20:47:48] Loading model from model/model.npz [2018-09-19 20:47:49] [memory] Reserving 237 MB, device cpu0 [2018-09-19 20:47:49] [memory] Reserving 118 MB, device gpu0 [2018-09-19 20:47:49] [memory] Reserving 118 MB, device gpu1 [2018-09-19 20:47:49] [memory] Reserving 237 MB, device gpu0 [2018-09-19 20:47:49] [memory] Reserving 237 MB, device gpu1 [2018-09-19 20:47:49] [memory] Reserving 118 MB, device gpu0 [2018-09-19 20:47:49] [memory] Reserving 118 MB, device gpu1 [2018-09-19 20:54:13] Ep. 1 : Up. 10500 : Sen. 2874986 : Cost 4.30 : Time 754.72s : 5910.11 words/s : L.r. 1.9688e-04 [2018-09-19 21:00:35] Ep. 1 : Up. 11000 : Sen. 3010777 : Cost 4.11 : Time 382.12s : 11597.02 words/s : L.r. 2.0625e-04 [2018-09-19 21:06:57] Ep. 1 : Up. 11500 : Sen. 3148646 : Cost 3.95 : Time 381.95s : 11534.43 words/s : L.r. 2.1563e-04 [2018-09-19 21:13:20] Ep. 1 : Up. 12000 : Sen. 3281766 : Cost 3.84 : Time 382.49s : 11587.70 words/s : L.r. 2.2500e-04 [2018-09-19 21:19:36] Ep. 1 : Up. 12500 : Sen. 3417524 : Cost 3.75 : Time 376.79s : 11559.61 words/s : L.r. 2.3438e-04 [2018-09-19 21:25:55] Ep. 1 : Up. 13000 : Sen. 3554128 : Cost 3.68 : Time 378.95s : 11500.20 words/s : L.r. 2.4375e-04 [2018-09-19 21:32:23] Ep. 1 : Up. 13500 : Sen. 3694291 : Cost 3.61 : Time 387.31s : 11723.47 words/s : L.r. 2.5313e-04 [2018-09-19 21:38:42] Ep. 1 : Up. 14000 : Sen. 3830735 : Cost 3.60 : Time 379.74s : 11483.31 words/s : L.r. 2.6250e-04 [2018-09-19 21:45:05] Ep. 1 : Up. 14500 : Sen. 3967136 : Cost 3.55 : Time 382.41s : 11608.55 words/s : L.r. 2.7188e-04 [2018-09-19 21:51:27] Ep. 1 : Up. 15000 : Sen. 4104151 : Cost 3.53 : Time 381.73s : 11533.49 words/s : L.r. 2.8125e-04 [2018-09-19 21:51:27] Saving model weights and runtime parameters to model/model.npz.orig.npz [2018-09-19 21:51:28] Saving model weights and runtime parameters to model/model.iter15000.npz [2018-09-19 21:51:29] Saving model weights and runtime parameters to model/model.npz [2018-09-19 21:51:30] Saving Adam parameters to model/model.npz.optimizer.npz [2018-09-19 21:51:38] [valid] Ep. 1 : Up. 15000 : ce-mean-words : 2.28124 : new best [2018-09-19 21:51:41] [valid] Ep. 1 : Up. 15000 : perplexity : 9.78885 : new best

duduscript commented 6 years ago

I use 2 tesla M40 cards and I find GPU util is 0, is it the problem of memory is not enough?

The command in shell script is as follows:

$MARIAN/build/marian \
    --model model/model.npz --type transformer \
    --train-sets data/corpus.bpe.en data/corpus.bpe.fr \
    --max-length 100 \
    --vocabs model/vocab.enfr.yml model/vocab.enfr.yml \
    --mini-batch-fit -w 10000 --maxi-batch 1000 \
    --early-stopping 10 --cost-type=ce-mean-words \
    --valid-freq 5000 --save-freq 5000 --disp-freq 500 \
    --valid-metrics ce-mean-words perplexity translation \
    --valid-sets data/valid.bpe.en data/valid.bpe.fr \
    --valid-script-path ./scripts/validate.sh \
    --valid-translation-output data/valid.bpe.en.output --quiet-translation \
    --valid-mini-batch 64 \
    --beam-size 6 --normalize 0.6 \
    --log model/train.log --valid-log model/valid.log \
    --enc-depth 6 --dec-depth 6 \
    --transformer-heads 8 \
    --transformer-postprocess-emb d \
    --transformer-postprocess dan \
    --transformer-dropout 0.1 --label-smoothing 0.1 \
    --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
    --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
    --tied-embeddings-all \
    --devices $GPUS --sync-sgd --seed 1111 \
    --exponential-smoothing

snukky commented 6 years ago

My guess is there is something wrong with your ./script/validate.sh. For instance, the script might wait for stdin or an argument blocking the validation.

duduscript commented 6 years ago

Hi @snukky, there is my validation script:

#!/bin/bash
cat $1 \
    | sed 's/\@\@ //g' \
    | ../tools/moses-scripts/scripts/recaser/detruecase.perl 2>/dev/null \
    | ../tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l fr 2>/dev/null \
    | ../tools/moses-scripts/scripts/generic/multi-bleu-detok.perl data/valid.fr \
    | sed -r 's/BLEU = ([0-9.]+),.*/\1/'

duduscript commented 6 years ago

@snukky I am very sorry that I haven't find the right way to format my code..

duduscript commented 6 years ago

I had tried to change the workspace to 8000 but no work

snukky commented 6 years ago

Do data/valid.bpe.en.output and data/valid.fr look OK? How large is your validation set?

This is a restarted training, right? The previous validation translations were successful?

duduscript commented 6 years ago

@snukky Both data/valid.bpe.en.output and data/valid.fr look OK, there is less than 1M data in validation set. I tried both restarted training and continued training, but stoped after a validation every time.

emjotde commented 6 years ago

Can you replace --valid-metrics ce-mean-words perplexity translation with --valid-metrics ce-mean-words perplexity bleu ?

This will use the internal BLEU scorer on the segment data, so it will overestimate quality, but maybe we can exclude a few things if it works.

snukky commented 6 years ago

1M sentences is a huge validation set, so translation and postprocessing will take a while. Did you make sure that the output translation file contains all translated sentences? To debug, use bleu metrics as @emjotde suggested and/or use a small validation set and set, for instance, --valid-freq 100.

emjotde commented 6 years ago

Is this 1M sentence or 1MB of total size in bytes?

duduscript commented 6 years ago

@snukky @emjotde Sorry, it is 1MB of total size in bytes. I think it is the problem of my shell script. Now I can not access the training machine, I will clone this repo and train again tomorrow.

duduscript commented 6 years ago

Hi @snukky @emjotde I tried again and there is the same problem. I train a en-fr model from start and it stops running after a validation. I run nvidia-smi command several times and the GPU util is always 0%. I am sure the process is stoped

./run-me.sh 0 1

And here is my script: run-me.sh:

#!/bin/bash -v

MARIAN=~/marian-dev/build
SRC=en
TRG=fr
ST=$SRC$TRG

# if we are in WSL, we need to add '.exe' to the tool names
if [ -e "/bin/wslpath" ]
then
    EXT=.exe
fi

MARIAN_TRAIN=$MARIAN/marian$EXT
MARIAN_DECODER=$MARIAN/marian-decoder$EXT
MARIAN_VOCAB=$MARIAN/marian-vocab$EXT
MARIAN_SCORER=$MARIAN/marian-scorer$EXT

# set chosen gpus
GPUS=0
if [ $# -ne 0 ]
then
    GPUS=$@
fi
echo Using GPUs: $GPUS

if [ ! -e $MARIAN_TRAIN ]
then
    echo "marian is not installed in $MARIAN, you need to compile the toolkit first"
    exit 1
fi

if [ ! -e ../tools/moses-scripts ] || [ ! -e ../tools/subword-nmt ] || [ ! -e ../tools/sacreBLEU ]
then
    echo "missing tools in ../tools, you need to download them first"
    exit 1
fi

if [ ! -e "data/corpus.$SRC" ]
then
    ./scripts/download-files.sh
fi

mkdir -p model

# preprocess data
if [ ! -e "data/corpus.bpe.$SRC" ]
then
    LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt13 -l $SRC-$TRG --echo src > data/valid.$SRC
    LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt13 -l $SRC-$TRG --echo ref > data/valid.$TRG

    LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt14 -l $SRC-$TRG --echo src > data/test2014.$SRC
    LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt15 -l $SRC-$TRG --echo src > data/test2015.$SRC
    LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt16 -l $SRC-$TRG --echo src > data/test2016.$SRC

    ./scripts/preprocess-data.sh
fi

# create common vocabulary
if [ ! -e "model/vocab.$ST.yml" ]
then
    cat data/corpus.bpe.$SRC data/corpus.bpe.$TRG | $MARIAN_VOCAB --max-size 36000 > model/vocab.$ST.yml
fi

# train model
if [ ! -e "model/model.npz" ]
then
    $MARIAN_TRAIN \
        --model model/model.npz --type transformer \
        --train-sets data/corpus.bpe.$SRC data/corpus.bpe.$TRG \
        --max-length 100 \
        --vocabs model/vocab.$ST.yml model/vocab.$ST.yml \
        --mini-batch-fit -w 10000 --maxi-batch 1000 \
        --early-stopping 10 --cost-type=ce-mean-words \
        --valid-freq 5000 --save-freq 5000 --disp-freq 500 \
        --valid-metrics ce-mean-words perplexity translation \
        --valid-sets data/valid.bpe.$SRC data/valid.bpe.$TRG \
        --valid-script-path "bash ./scripts/validate.sh" \
        --valid-translation-output data/valid.bpe.$SRC.output --quiet-translation \
        --valid-mini-batch 64 \
        --beam-size 6 --normalize 0.6 \
        --log model/train.log --valid-log model/valid.log \
        --enc-depth 6 --dec-depth 6 \
        --transformer-heads 8 \
        --transformer-postprocess-emb d \
        --transformer-postprocess dan \
        --transformer-dropout 0.1 --label-smoothing 0.1 \
        --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
        --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
        --tied-embeddings-all \
        --devices $GPUS --sync-sgd --seed 1111 \
        --exponential-smoothing
fi

# find best model on dev set
ITER=`cat model/valid.log | grep translation | sort -rg -k12,12 -t' ' | cut -f8 -d' ' | head -n1`

# translate test sets
for prefix in test2014 test2015 test2016
do
    cat data/$prefix.bpe.$SRC \
        | $MARIAN_DECODER -c model/model.npz.decoder.yml -m model/model.iter$ITER.npz -d $GPUS -b 12 -n -w 6000 \
        | sed 's/\@\@ //g' \
        | ../tools/moses-scripts/scripts/recaser/detruecase.perl \
        | ../tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l $TRG \
        > data/$prefix.$TRG.output
done

# calculate bleu scores on test sets
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt14 -l $SRC-$TRG < data/test2014.$TRG.output
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt15 -l $SRC-$TRG < data/test2015.$TRG.output
LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt16 -l $SRC-$TRG < data/test2016.$TRG.output

download-files.sh:

#!/bin/bash -v

mkdir -p data
cd data

# get En-De training data for WMT17
wget -nc http://www.statmt.org/europarl/v7/fr-en.tgz
wget -nc http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
wget -nc http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz

# extract data
tar -xf fr-en.tgz
tar -xf training-parallel-commoncrawl.tgz
tar -xf training-parallel-nc-v12.tgz

# create corpus files
cat europarl-v7.fr-en.fr commoncrawl.fr-en.fr training/news-commentary-v12.fr-en.fr > corpus.fr
cat europarl-v7.fr-en.en commoncrawl.fr-en.en training/news-commentary-v12.fr-en.en > corpus.en

# clean
rm -r europarl-* commoncrawl.* training/ *.tgz

cd ..

preprocess-data.sh:

#!/bin/bash -v

# suffix of source language files
SRC=en

# suffix of target language files
TRG=fr

# number of merge operations
bpe_operations=32000

# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=../tools/moses-scripts

# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=../tools/subword-nmt

# tokenize
for prefix in corpus valid test2014 test2015 test2016
do
    cat data/$prefix.$SRC \
        | $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC \
        | $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $SRC > data/$prefix.tok.$SRC

    test -f data/$prefix.$TRG || continue

    cat data/$prefix.$TRG \
        | $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $TRG \
        | $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $TRG > data/$prefix.tok.$TRG
done

# clean empty and long sentences, and sentences with high source-target ratio (training corpus only)
mv data/corpus.tok.$SRC data/corpus.tok.uncleaned.$SRC
mv data/corpus.tok.$TRG data/corpus.tok.uncleaned.$TRG
$mosesdecoder/scripts/training/clean-corpus-n.perl data/corpus.tok.uncleaned $SRC $TRG data/corpus.tok 1 100

# train truecaser
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/corpus.tok.$SRC -model model/tc.$SRC
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/corpus.tok.$TRG -model model/tc.$TRG

# apply truecaser (cleaned training corpus)
for prefix in corpus valid test2014 test2015 test2016
do
    $mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$SRC < data/$prefix.tok.$SRC > data/$prefix.tc.$SRC
    test -f data/$prefix.tok.$TRG || continue
    $mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$TRG < data/$prefix.tok.$TRG > data/$prefix.tc.$TRG
done

# train BPE
cat data/corpus.tc.$SRC data/corpus.tc.$TRG | $subword_nmt/learn_bpe.py -s $bpe_operations > model/$SRC$TRG.bpe

# apply BPE
for prefix in corpus valid test2014 test2015 test2016
do
    $subword_nmt/apply_bpe.py -c model/$SRC$TRG.bpe < data/$prefix.tc.$SRC > data/$prefix.bpe.$SRC
    test -f data/$prefix.tc.$TRG || continue
    $subword_nmt/apply_bpe.py -c model/$SRC$TRG.bpe < data/$prefix.tc.$TRG > data/$prefix.bpe.$TRG
done

validate.sh:

#!/bin/bash

cat $1 \
    | sed 's/\@\@ //g' \
    | ../tools/moses-scripts/scripts/recaser/detruecase.perl 2>/dev/null \
    | ../tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l fr 2>/dev/null \
    | ../tools/moses-scripts/scripts/generic/multi-bleu-detok.perl data/valid.fr \
    | sed -r 's/BLEU = ([0-9.]+),.*/\1/'

duduscript commented 6 years ago

marian command in marian-dev is compiled with profile flag, is it the problem?

emjotde commented 6 years ago

@duduscript we dropped the ball on this one. Do you still have that problem?

frankseide commented 6 years ago

I also observed what appeared like a dead-lock after validation, but in reality, it was some weird race condition that seemed to not detect when the data-prefetch thread had produced data, and would keep kicking it off over again. In my case, this led to either no log output of 4 hours (at which point our server farm killed it), or failing with std::bad_alloc.

This bug has since been fixed, but I don't yet see the change in the public master.

@duduscript, if during hanging in the dead-lock, the process keeps allocating memory, then with luck, that problem should go away after the next merge from our internal master.

emjotde commented 6 years ago

Didn't this only happen after we started kicking the hornets nest that was prefetching? This is from before that time.

duduscript commented 6 years ago

Hi @emjotde @frankseide , thank you. I solved this problem by using a shell script kill and restart the process when this problem happened. We use marian for benchmark work and this work was finished. This problem seems not affect the result for inference.

emjotde commented 6 years ago

@duduscript so just to clarify, it was something inside the shell script that caused the hang?

duduscript commented 6 years ago

@frankseide Memory is enough, it seems that memory is not the problem.

duduscript commented 6 years ago

@emjotde I think it is the problem of shell script but I am not sure where it is.

frankseide commented 6 years ago

No, the bug was that the prefetch mechanism would keep prefetching, adding more and more minibatches into the internal queue, but the foreground thread did not notice and kept telling the prefetch thread to keep going. Eventually, it would fill the entire available RAM and fail with out of memory. Anyway, if that was the problem, it will be solved soon, once we complete merging our latest internal master to public.

emjotde commented 6 years ago

I think these might have been unrelated issues. Closing this then.

duduscript commented 6 years ago

@frankseide Maybe you are right, I did not noticed the RAM usage change. Thank you.

marian-nmt / marian

deadlock in training #215