marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.21k stars 227 forks source link

Cost is nan When adding guided alignment #417

Open mahmoudaymo opened 11 months ago

mahmoudaymo commented 11 months ago

Bug description

I have trained a model for 5 epochs without guided alignment. Then I trained for 5 epochs more with guided alignment. When training without guided alignment everything went fine. However, when adding the guided alignment (the second 5 epochs) the training cost is nan in every update.

How to reproduce

Describe steps or include command to reproduce the behavior. I have run this script:

`#!/bin/bash

set -e

exp_dir=path_to_experiment_dir

exp=$exp_dir/basemodel config=$exp/config.yml

/marian/build/marian -c $config \ --valid-log $exp/valid.log \ --log $exp/train.log \ --model $exp/model.npz \ --after 5e

exp=$exp_dir/finetuned config=$exp/config.yml # This config is similar to the above except I unset --all-caps-every and --english-title-case-every params

/marian/build/marian -c $config \ --pretrained-model $pretrained_model_path \ --valid-log $exp/valid.log \ --log $exp/train.log \ --model $exp/model.npz \ --after 10e \ --guided-alignment /Engines/MAS/ENUSDEDE/alignment/corpus.align \ --guided-alignment-cost ce` marian.logs.txt

Context

Add any other information about the problem here.

TransperfectAI commented 7 months ago

We are experiencing this issue, too, even when training with alignment from the start. Could it be related to the guided-alignment-cost? We used to use mse and then changed to ce when mse was no longer supported. The issue started after that for us. It also means that to restart training in a directory you need to edit the cost in the model.npz.progress.yml or it throws an error