in the experiment, the evaluation pipeline had slight differences:
a. train pipeline was filtering long sentences, while eval was not - may have a significant influence;
b. train pipeline was using SpecAugment, while eval was not - should not influence eval;
An experiment: https://docs.google.com/spreadsheets/d/1q-pInubS69ZMxlMOd-D42W1DA-D6jKXmFr9uV-tsAro/edit#gid=0&range=15:15
Possible reasons:
Config used in the experiment: https://github.com/ryanleary/mlperf-rnnt-ref/blob/4082f086ec4834886cceb927dbb1454eca44c68d/configs/rnnt.toml