Training stuck in validation phase.

I'm training a transformer model on a corpus of 30M sentences with the following command line parameters:

MARIAN_EXEC=~/marian-dev/build/marian

${MARIAN_EXEC} \
    --devices 0 \
    --type transformer \
    --model ${MODEL_HOME_DIR}/model_en-it.npz \
    --train-sets ${MODEL_HOME_DIR}/corpus-tr-30M.en ${MODEL_HOME_DIR}/corpus-tr-30M.it \
    --vocabs ${MODEL_HOME_DIR}/vocab.en-it.spm ${MODEL_HOME_DIR}/vocab.en-it.spm \
    --dim-vocabs 32000 32000 \
    --sentencepiece-options '--normalization_rule_name=nmt_nfkc' \
    --mini-batch-fit -w 12000 \
    --sentencepiece-max-lines 1000000 \
    --layer-normalization --tied-embeddings-all \
    --dropout-src 0.1 --dropout-trg 0.1 \
    --early-stopping 10 --max-length 90 \
    --valid-freq 10000  --save-freq 5000  --disp-freq 500 \
    --cost-type ce-mean-words --valid-metrics ce-mean-words bleu-detok \
    --valid-sets ${MODEL_HOME_DIR}/devset.en ${MODEL_HOME_DIR}/devset.it \
    --log ${MODEL_HOME_DIR}/train.log \
    --valid-log ${MODEL_HOME_DIR}/valid.log \
    --tempdir ${MODEL_HOME_DIR}/temp_files \
    --overwrite --keep-best \
    --seed 1111 --exponential-smoothing \
    --normalize=0.6 --beam-size=6 --quiet-translation \
    --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
    --enc-depth 6 --dec-depth 6 \
    --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
    --transformer-heads 8 \
    --transformer-postprocess-emb d \
    --transformer-postprocess dan \
    --transformer-dropout 0.1 --label-smoothing 0.1

The training stops after 10000 updates, and the process runs

for days without any progress. With a validation frequency of 5000 I see the output of the validation step, but the training seems to halt with the process using one cpu core fully.

This is the output of top

top - 14:41:49 up 4 days, 18:03,  1 user,  load average: 1.00, 1.02, 1.00
Tasks: 203 total,   1 running, 202 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.1 us,  0.2 sy,  0.0 ni, 74.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15718.1 total,    228.6 free,   8498.2 used,   6991.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   6880.6 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 302151 ubuntu    20   0   25.9g   7.8g 147460 S  99.7  51.0 130:38.85 marian
   1475 ubuntu    20   0   13744   9764   2764 S   0.7   0.1   7:59.81 tmux: server
 302792 root     -51   0       0      0      0 S   0.3   0.0   0:31.97 irq/42-nvidia
 302797 root      20   0       0      0      0 S   0.3   0.0   0:14.71 nv_queue
      1 root      20   0  169004  10380   5768 S   0.0   0.1   0:23.19 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.04 kthreadd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-kblo+
      9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq

My configuration is

Marian dev 1.9.0
Cuda 10.2
Boost 1.71
Nvidia Tesla T4

I'm also attaching the train.log

Any idea of what I could be doing wrong? Thanks in advance for your help

marian-nmt / marian

Training stuck in validation phase. #354