Translating corpus with teacher: `nan`s when using `float16`

AmitMY commented 8 months ago

After training two, reasonable teacher models, and seeing that their ensemble results are reasonable as well, seeing the Translating corpus with teacher output is bad (repeating a random token, and giving nans)

head /data/data/spoken-signed/spoken_to_signed/translated/corpus/file.00.nbest

0 ||| p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 p412 ||| F0= nan F1= nan ||| nan

and all of the other lines look very similar.

GPU: NVIDIA GeForce RTX 2080 Ti, decoding-teacher with precision: float16

My current suspicion is that this is because of the precision: float16, but I can't immediately confirm it since I destroyed a lot of my environment just to figure out what's wrong here...

AmitMY commented 8 months ago

Confirmed. without precision: float16 it decodes correctly, for example:

0 ||| $pt $bzs M p500 p500 S300 c0 r4 p482 p482 S17d c1 re p492 p522 S177 c0 re p490 p556 S22a c0 r4 p513 ||| F0= -12.5695 F1= -12.547 ||| -1.04652

(which in sign language visualizes to )

eu9ene commented 8 months ago

Half precision is useful only to speed up inference on supported GPUs for the translate step. We usually use this mode and it works. Generally, it shouldn't be used for evaluation unless you're planning to run your model in this mode.

AmitMY commented 8 months ago

The example here https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.prod.yml#L57-L61 says "2080ti or newer", but for me on a 2080ti, this setting causes nans for the inference (not for evaluation)

mozilla / firefox-translations-training

Translating corpus with teacher: `nan`s when using `float16` #244