When trying to pretrain t5-base, we are seeing that that pretraining loss starts at an enormous number (~160000).
Even when trying to pretrain smaller variants of t5, the initial pretraining loss always starts at around 160000.
I am currently trying to reproduce the results of the t5 paper, the baseline results of Table 1 in https://arxiv.org/pdf/1910.10683.pdf and I was wondering if there is a particular initialization that I should be using to reproduce the results.
If possible, would someone be able to share a loss curve of the pretraining of the t5-base?
When trying to pretrain t5-base, we are seeing that that pretraining loss starts at an enormous number (~160000). Even when trying to pretrain smaller variants of t5, the initial pretraining loss always starts at around 160000.
This is the train/loss curve for training t5-base![pretrain_base](https://user-images.githubusercontent.com/25857728/218966633-64f9f11c-c957-4f45-a7fd-2f22b2af2e90.png)
This is the train/loss cure for training t5-mini from https://arxiv.org/pdf/2109.10686.pdf![mini](https://user-images.githubusercontent.com/25857728/218968676-f79618b1-dc8d-4ee6-927b-f251d9e7d214.png)
As you can see, both curves begin at around 160000. In both cases, I am using the hyperparameters defined in https://github.com/google-research/t5x/blob/main/t5x/examples/t5/t5_1_0/base.gin for pretraining.
I am currently trying to reproduce the results of the t5 paper, the baseline results of Table 1 in https://arxiv.org/pdf/1910.10683.pdf and I was wondering if there is a particular initialization that I should be using to reproduce the results.
If possible, would someone be able to share a loss curve of the pretraining of the t5-base?