Pretraining Initialization and Loss

When trying to pretrain t5-base, we are seeing that that pretraining loss starts at an enormous number (~160000). Even when trying to pretrain smaller variants of t5, the initial pretraining loss always starts at around 160000.

This is the train/loss curve for training t5-base pretrain_base

This is the train/loss cure for training t5-mini from https://arxiv.org/pdf/2109.10686.pdf mini

As you can see, both curves begin at around 160000. In both cases, I am using the hyperparameters defined in https://github.com/google-research/t5x/blob/main/t5x/examples/t5/t5_1_0/base.gin for pretraining.

I am currently trying to reproduce the results of the t5 paper, the baseline results of Table 1 in https://arxiv.org/pdf/1910.10683.pdf and I was wondering if there is a particular initialization that I should be using to reproduce the results.

If possible, would someone be able to share a loss curve of the pretraining of the t5-base?

google-research / t5x

Pretraining Initialization and Loss #1093