NaN/Inf percentage 0.50 in 10 gradient updates, but cost-scaling factor 7.62939e-06 is already at minimum

Bug description

After quite a lot updates, an error (NaN/Inf percentage) starts occurring, as shown here

[2022-05-09 06:43:30] [valid] Ep. 8 : Up. 535000 : translation : 49.5 : stalled 14 times (last best: 49.7) [2022-05-09 06:45:23] Ep. 8 : Up. 535500 : Sen. 94,495,934 : Cost nan : Time 149.83s : 144461.61 words/s : gNorm 0.6068 : L.r. 5.1856e-05 [2022-05-09 06:47:14] Ep. 8 : Up. 536000 : Sen. 95,835,842 : Cost nan : Time 111.05s : 188680.04 words/s : gNorm 0.6060 : L.r. 5.1832e-05 [2022-05-09 06:49:06] Ep. 8 : Up. 536500 : Sen. 97,210,303 : Cost nan : Time 111.42s : 190628.95 words/s : gNorm 1.4893 : L.r. 5.1808e-05 [2022-05-09 06:50:57] Ep. 8 : Up. 537000 : Sen. 98,569,133 : Cost 2.46588373 : Time 111.03s : 191311.12 words/s : gNorm 4.6161 : L.r. 5.1784e-05 [2022-05-09 06:52:48] Ep. 8 : Up. 537500 : Sen. 99,951,273 : Cost 2.53674269 : Time 111.48s : 192168.72 words/s : gNorm 3.4782 : L.r. 5.1760e-05 [2022-05-09 06:54:39] Ep. 8 : Up. 538000 : Sen. 101,269,034 : Cost nan : Time 110.79s : 190254.29 words/s : gNorm 4.0652 : L.r. 5.1736e-05 [2022-05-09 06:55:01] NaN/Inf percentage 0.70 in 10 gradient updates, but cost-scaling factor 7.62939e-06 is already at minimum [2022-05-09 06:55:03] NaN/Inf percentage 0.50 in 10 gradient updates, but cost-scaling factor 7.62939e-06 is already at minimum [2022-05-09 06:55:05] NaN/Inf percentage 0.50 in 10 gradient updates, but cost-scaling factor 7.62939e-06 is already at minimum [2022-05-09 06:55:08] NaN/Inf percentage 0.80 in 10 gradient updates, but cost-scaling factor 7.62939e-06 is already at minimum [2022-05-09 06:55:10] NaN/Inf percentage 0.73 in 11 gradient updates, but cost-scaling factor 7.62939e-06 is already at minimum [2022-05-09 06:55:12] NaN/Inf percentage 0.60 in 10 gradient updates, but cost-scaling factor 7.62939e-06 is already at minimum

How to reproduce

Describe steps or include command to reproduce the behavior.

marian --tempdir ./train_standard_fp16/marian_temp --log ./train_standard_fp16/en__it/train_dir/train.log --valid-log train_standard_fp16/valid.log --model model/model.npz --train-sets data/train.sl data/train.tl --vocabs data/model.spv data/model.spv --valid-sets ./data/dev.sl ./data/dev.tl --valid-script-path ./validate.sh --valid-translation-output dev.bpe.out --valid-metrics ce-mean-words perplexity --shuffle-in-ram --quiet-translation --keep-best --devices 0 1 2 3 4 5 6 7 --fp16 --save-from 20000 --valid-from 20000 --type transformer --task transformer-big --layer-normalization --label-smoothing 0.1 --max-length 512 --mini-batch-fit --mini-batch 1000 --maxi-batch 1000 --disp-freq 500 --valid-freq 5000 --valid-mini-batch 64 --beam-size 12 --normalize 1 --save-freq 5000 --early-stopping 5 --cost-type ce-mean-words --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --sync-sgd --seed 314 --exponential-smoothing --workspace 8192

Context

standard compilation on Ubuntu 20.04
NVIDIA-SMI 470.103.01
CUDA Version: 11.4

Using fp32 precision everything works fine.

marian-nmt / marian

NaN/Inf percentage 0.50 in 10 gradient updates, but cost-scaling factor 7.62939e-06 is already at minimum #386

Bug description

How to reproduce

Context