Does fine tuning tasks re-initiate optimizer states?

I'm trying to reproduce the T5.1.1 fine-tuning on GLUE result from this paper https://arxiv.org/pdf/2110.08529.pdf. I read through the configuration and source code implementation of Adafactor from this repo to ensure that I'm using consistent hyper-parameters while training using hugging face and pytorch.

However, I can't reproduce the performance. I have preprocessed the GLUE dataset as something like <task_name> sentence1: <sentence1> sentence2: <sentence2> <EOS> following the SeqIO GLUE preprocessor in the T5 repo.

Another reason I could think of that can cause this performance drop is that when fine-tuning from .gin config file, the optimizer state is also loaded from the checkpoint and then the fine-tuning happens. But this is infeasible to reproduce if one only has access to publicly accessible checkpoints.

I tried to run this repo and add breakpoints but didn't get it. Could someone help me confirm if the optimizer state is loaded if we run fine-tuning scripts from this repo?

google-research / t5x

Does fine tuning tasks re-initiate optimizer states? #1478