huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.49k stars 26.89k forks source link

[Seq2Seq] (Byt5) zero loss #14132

Closed flozi00 closed 3 years ago

flozi00 commented 3 years ago

Environment info

Who can help

@patrickvonplaten @patil-suraj

Information

Model I am using (Bert, XLNet ...): byt5

for comparison, e.g coding mistake on my side I used other seq2seq models like t5 too, these models are working as expected

The problem arises when using:

The tasks I am working on is:

To reproduce

image

Steps to reproduce the behavior:

  1. running official seq2seq example script with trainer
  2. train using byt5 models, any size with fp16 pytorch backend
  3. see loss is going to zero after some steps (500 maximum to zero loss in my experiments)
  4. do inference and see that every output is ''

Expected behavior

training the model with reasonable loss and generating good text

NielsRogge commented 3 years ago

Not sure if ByT5 supports fp16 training, cc @patrickvonplaten

patil-suraj commented 3 years ago

Hi!

I am not sure if we have tested ByT5 with seq2seq scripts yet. Which script are you using, run_translation.py or run_summarizaton.py ? Would be nice if you could post a snippet to reproduce this.

Also, note that T5 (and ByT5 as well) models are trained with bf16, which may or may not work with fp16. See this discussion or forum However, this usually results in nan losses which isn't the case here. So I won't be sure without looking at the command that you are using.

flozi00 commented 3 years ago

removing the --fp16 argument fixes it, when using run translation script