[Seq2Seq] (Byt5) zero loss

flozi00 commented 3 years ago

Environment info

transformers version: master
Platform: win + linux
Python version: 3.8
PyTorch version (GPU?): yes
Tensorflow version (GPU?):
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help

@patrickvonplaten @patil-suraj

Information

Model I am using (Bert, XLNet ...): byt5

for comparison, e.g coding mistake on my side I used other seq2seq models like t5 too, these models are working as expected

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

running official seq2seq example script with trainer
train using byt5 models, any size with fp16 pytorch backend
see loss is going to zero after some steps (500 maximum to zero loss in my experiments)
do inference and see that every output is ''

Expected behavior

training the model with reasonable loss and generating good text

NielsRogge commented 3 years ago

Not sure if ByT5 supports fp16 training, cc @patrickvonplaten

patil-suraj commented 3 years ago

Hi!

I am not sure if we have tested ByT5 with seq2seq scripts yet. Which script are you using, run_translation.py or run_summarizaton.py ? Would be nice if you could post a snippet to reproduce this.

Also, note that T5 (and ByT5 as well) models are trained with bf16, which may or may not work with fp16. See this discussion or forum However, this usually results in nan losses which isn't the case here. So I won't be sure without looking at the command that you are using.

flozi00 commented 3 years ago

removing the --fp16 argument fixes it, when using run translation script

huggingface / transformers