Reproducing the flan_v2 results of T5-xl

danczs commented 1 year ago

First, thanks for this excellent work. However, I met some problems when reproducing the results of T2-xl.

My setting is:

Pretrained model and optimizer: I used the T5-v1_1-xl pretrained model and following the training setting in "Scaling Instruction-Finetuned Language Models": batch size 64, Dropout 0.05, LR 5e-4, 38K steps, adafactor optimizer.

Data: For the data, I first used the training data provided by SirNeural and evaluated the model on MMLU. When I equally sampled the 5 datasets (i.e. cot, flanv2, t0, diglog, niv2), I got 45% 5-shot accuracy on MMLU, which is similar to the w/o mixture balancing result in the paper. However, after I mixed the data with the suggested rates here, the accuracy is not improved (44%).

Afterwards, I tried the data provided by Enrico Shippole and mixed the data following the suggested rates. But the accuracy became worse (42% on MMLU). I also tried to use a larger batch size (128, considering batch packing ) and deduplicate the data, which nearly didn't help.

Are there any suggestions to reproduce the MMLU results of the released Flan-xl-t5 model (49%) or even the results in the paper(52%) ? Thanks a lot.

shayne-longpre commented 1 year ago

@danczs Thanks for the question.

A couple thoughts:

I'm surprised that the SirNeural did so well. The reason we recommended Enrico's version is because it properly applies the dataset caps and sample balancing so the overall mixture isn't dominated by the largest datasets. Even if you balance the SirNeural subsets with the given mixture ratios, they should still be dominated by the largest datasets in each subset.
I would not hold too strictly to the hyperparameters we used in the paper, since that was with a particular internal configuration of TPUs that I've heard doesn't generalize well to GPUs or other TPU configurations necessarily.
Is the model still learning after 38k steps? We trained for longer I believe.
Surprised Enrico's version didn't do very well. If you just train with T0 or Flan2021 subsets does it do better than 42% -- I believe it should? Wondering if there is an issue with some subset of the data.

Sorry this could not be more helpful. It's hard to translate the internal code (which I no longer have access to) to external implementations. I would also note that my co-authors did A LOT of tuning and runs with internal configuration to get the 52% number. The variability per run with the same data can vary by 1-2% max performance, and then between checkpoints on the same run you might see another 1-2% variability even after its converged. (Just something to keep in mind.)

Best,

danczs commented 1 year ago

@shayne-longpre Thanks very much for your reply.

Yeah, SirNeural's dataset is much larger than Enrico's (e.g. 300 million vs 5 million instances for FLAN2021) and is at risk of duplication. However, it seems that Enrico's dataset has its own problem.
Thanks for your suggestion, I'll try more hyper-parameters.
Yeah, according to the trend, the loss will continue decreasing. But the model is a little overfitting, since the result is not improved or even worse after 20k steps. I'll try a much lower lr with more steps.
I tried Enrico's FLAN 2021 set for 40k steps and only got 35%.

Thanks for your explanations, it helps a lot.

shayne-longpre commented 1 year ago

@danczs Hmm I'm not sure why it was so low. I noticed that a few recent papers seem to have gotten strong results with a 100k sample of the training data (e.g. https://arxiv.org/pdf/2306.04751.pdf) and their training code is public.

Also, maybe Hyung Won's recent comments provide some insights here?

google-research / FLAN

Reproducing the flan_v2 results of T5-xl #80