Pre-layer_normalize with deep depth model is not working in current version

lkfo415579 commented 4 years ago

Bug description

I usually use pre-layer_normalize to train deep-depth transformer in the earier version (v1.7.6), it works.

But when i switch to using newest version of marian, the loss will be from 10 increases up to 50000 at the beginging of training, it seems to having a gradient explosion (somehow?). I looked at the code in transformer-preprocess, it seems to be fine, i don't know why it is happening.

Even though i keep training til 50000 steps, the loss becomes larger and larger. (like 10 times larger?)

How to reproduce

    $MARIAN_TRAIN \    
        --model $MODEL_DIR/model_revo.npz \    
        --train-sets $CORPUS_DIR/$TRAIN.$SRCL $CORPUS_DIR/$TRAIN.$TGTL \    
        --max-length 140 \    
        --vocabs $MODEL_DIR/vocab.$SRCL.yml $MODEL_DIR/vocab.$TGTL.yml \    
        --maxi-batch 1000 --mini-batch 64  -w 9000 --max-length-crop \    
        --early-stopping 10 --cost-type=ce-mean-words \    
        --valid-freq 2500 --save-freq 2500 --disp-freq 1 \    
        --valid-metrics ce-mean-words perplexity translation \    
        --valid-sets $CORPUS_DIR/$VALID.$SRCL $CORPUS_DIR/$VALID.$TGTL \    
        --valid-script-path "bash ./validate-"$SRCL\-$TGTL".sh" \    
        --valid-translation-output $OUTPUT_DIR/$MODEL_NAME.tf.$SRCL$TGTL.single --quiet-translation \    
        --valid-mini-batch 30 \    
        --beam-size 6 --normalize 1.0 \    
        --log $MODEL_DIR/train.log --valid-log $MODEL_DIR/valid.log \    
        --task transformer-base \    
        --devices $GPUS --seed $ID$ID$ID$ID --keep-best --overwrite \    
        --enc-depth 20 --dec-depth 6 --transformer-preprocess n --transformer-postprocess da \    
        --transformer-postprocess-emb d \    
        --disp-label-counts --tied-embeddings-all false --clip-norm 5

Context

Marian version: v1.9.9
CMake command: cmake ..

Log file:

[2020-05-11 15:47:38] Ep. 1 : Up. 1 : Sen. 276 : Cost 48747.16796875 : Time 20.87s : 211.56 words/s : L.r. 1.8750e-08    
[2020-05-11 15:47:38] Ep. 1 : Up. 2 : Sen. 393 : Cost 44797.03125000 : Time 0.60s : 13293.12 words/s : L.r. 3.7500e-08    
[2020-05-11 15:47:39] Ep. 1 : Up. 3 : Sen. 882 : Cost 49351.94531250 : Time 0.66s : 18449.51 words/s : L.r. 5.6250e-08    
[2020-05-11 15:47:40] Ep. 1 : Up. 4 : Sen. 999 : Cost 48174.21093750 : Time 0.55s : 8528.60 words/s : L.r. 7.5000e-08    
[2020-05-11 15:47:40] Ep. 1 : Up. 5 : Sen. 1,149 : Cost 46907.64062500 : Time 0.54s : 16865.15 words/s : L.r. 9.3750e-08    
[2020-05-11 15:47:40] Ep. 1 : Up. 6 : Sen. 1,370 : Cost 50861.28515625 : Time 0.41s : 15095.13 words/s : L.r. 1.1250e-07    
[2020-05-11 15:47:41] Ep. 1 : Up. 7 : Sen. 1,706 : Cost 47357.91406250 : Time 0.67s : 18519.32 words/s : L.r. 1.3125e-07    
[2020-05-11 15:47:42] Ep. 1 : Up. 8 : Sen. 1,982 : Cost 47471.65625000 : Time 0.67s : 18443.65 words/s : L.r. 1.5000e-07    
[2020-05-11 15:47:42] Ep. 1 : Up. 9 : Sen. 2,101 : Cost 45520.09375000 : Time 0.50s : 15228.75 words/s : L.r. 1.6875e-07    
[2020-05-11 15:47:43] Ep. 1 : Up. 10 : Sen. 2,113 : Cost 41026.32812500 : Time 0.28s : 3284.58 words/s : L.r. 1.8750e-07

Add any other information about the problem here.

emjotde commented 4 years ago

You mean this --transformer-preprocess n --transformer-postprocess da, right?

I will take a look. Not really aware of any changes here. Is this happening for all data-sets? Can you verify it is fine with the exact same settings and data for the older version?

lkfo415579 commented 4 years ago

@emjotde Yes. It is happening for all data-sets(WMT17 zh-en, WMT14 de-en, CCMT2020 zh-en), here i show you my older version run.me script and its training log.

P.S. : Actually i tried depth-scaling method and it works stably in training 20 depth model. But my friend said using pre-normalization can achieve better BLEU, therefore i tried these experiments.

    $MARIAN_TRAIN \
        --model $MODEL_DIR/model_revo.npz --type transformer \
        --train-sets $CORPUS_DIR/$TRAIN.$SRCL $CORPUS_DIR/$TRAIN.$TGTL \
        --max-length 140 \
        --vocabs $MODEL_DIR/vocab.$SRCL.yml $MODEL_DIR/vocab.$TGTL.yml \
        --mini-batch-fit -w 9050 --maxi-batch 20000 \
        --early-stopping 10 --cost-type=ce-mean-words \
        --valid-freq 2500 --save-freq 2500 --disp-freq 1 \
        --valid-metrics ce-mean-words perplexity translation \
        --valid-sets $CORPUS_DIR/$VALID.$SRCL $CORPUS_DIR/$VALID.$TGTL \
        --valid-script-path "bash ./validate-"$SRCL\-$TGTL".sh" \
        --valid-translation-output $OUTPUT_DIR/$MODEL_NAME.tf.$SRCL$TGTL.single --quiet-translation \
        --valid-mini-batch 30 \
        --beam-size 6 --normalize 1.0 \
        --log $MODEL_DIR/train.log --valid-log $MODEL_DIR/valid.log \
        --enc-depth 20 --dec-depth 6 \
        --transformer-heads 8 \
        --transformer-postprocess-emb d \
        --transformer-preprocess n \
        --transformer-postprocess da \
        --transformer-dropout 0.1 --transformer-dropout-attention 0 --label-smoothing 0.1 \
        --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
        --optimizer-params 0.9 0.98 1e-09 \
        --devices $GPUS --sync-sgd --seed $ID$ID$ID$ID --keep-best --overwrite \
        --exponential-smoothing --after-batches 150000 --print_mod --disp-label-counts \
        --update_cycle 1

# Notes: update_cycle is as same as optimizer-delay, and print_mod is just a function that prints the norm of source embedding.

[2020-05-11 12:57:18] Ep. 1 : Up. 1 : Sen. 744 : Cost 11.12840176 * 26,040 after 26,040 : Time 38.71s : 672.72 words/s : L.r. 1.8750e-08
[2020-05-11 12:57:18] Ep. 1 : Up. 2 : Sen. 1,096 : Cost 11.10829163 * 14,432 after 40,472 : Time 0.55s : 26041.43 words/s : L.r. 3.7500e-08
[2020-05-11 12:57:19] Ep. 1 : Up. 3 : Sen. 2,648 : Cost 11.17746735 * 23,280 after 63,752 : Time 0.66s : 35126.33 words/s : L.r. 5.6250e-08
[2020-05-11 12:57:20] Ep. 1 : Up. 4 : Sen. 4,200 : Cost 11.20205307 * 27,936 after 91,688 : Time 0.72s : 38565.99 words/s : L.r. 7.5000e-08
[2020-05-11 12:57:20] Ep. 1 : Up. 5 : Sen. 4,530 : Cost 11.18721485 * 11,880 after 103,568 : Time 0.41s : 29324.50 words/s : L.r. 9.3750e-08
[2020-05-11 12:57:21] Ep. 1 : Up. 6 : Sen. 4,882 : Cost 11.12029457 * 18,355 after 121,923 : Time 0.60s : 30588.81 words/s : L.r. 1.1250e-07
[2020-05-11 12:57:21] Ep. 1 : Up. 7 : Sen. 5,626 : Cost 11.15162086 * 11,160 after 133,083 : Time 0.58s : 19111.01 words/s : L.r. 1.3125e-07
[2020-05-11 12:57:22] Ep. 1 : Up. 8 : Sen. 6,270 : Cost 11.18948174 * 14,812 after 147,895 : Time 0.46s : 32348.38 words/s : L.r. 1.5000e-07
[2020-05-11 12:57:22] Ep. 1 : Up. 9 : Sen. 7,014 : Cost 11.12756062 * 28,272 after 176,167 : Time 0.69s : 41120.13 words/s : L.r. 1.6875e-07
[2020-05-11 12:57:23] Ep. 1 : Up. 10 : Sen. 7,366 : Cost 11.11898613 * 17,142 after 193,309 : Time 0.60s : 28808.16 words/s : L.r. 1.8750e-07

train.log of CCMT2020 zh-en by using older version

marian-nmt / marian-dev