Open mahmoudaymo opened 1 year ago
We are experiencing this issue, too, even when training with alignment from the start. Could it be related to the guided-alignment-cost? We used to use mse and then changed to ce when mse was no longer supported. The issue started after that for us.
It also means that to restart training in a directory you need to edit the cost in the model.npz.progress.yml
or it throws an error
Bug description
I have trained a model for 5 epochs without guided alignment. Then I trained for 5 epochs more with guided alignment. When training without guided alignment everything went fine. However, when adding the guided alignment (the second 5 epochs) the training cost is nan in every update.
How to reproduce
Describe steps or include command to reproduce the behavior. I have run this script:
`#!/bin/bash
set -e
exp_dir=path_to_experiment_dir
exp=$exp_dir/basemodel config=$exp/config.yml
/marian/build/marian -c $config \ --valid-log $exp/valid.log \ --log $exp/train.log \ --model $exp/model.npz \ --after 5e
exp=$exp_dir/finetuned config=$exp/config.yml # This config is similar to the above except I unset --all-caps-every and --english-title-case-every params
/marian/build/marian -c $config \ --pretrained-model $pretrained_model_path \ --valid-log $exp/valid.log \ --log $exp/train.log \ --model $exp/model.npz \ --after 10e \ --guided-alignment /Engines/MAS/ENUSDEDE/alignment/corpus.align \ --guided-alignment-cost ce` marian.logs.txt
Context
--build-info all
Add any other information about the problem here.