google-research / FLAN

Apache License 2.0
1.47k stars 155 forks source link

Reproducing the flan_v2 results of T5-xl #80

Open danczs opened 1 year ago

danczs commented 1 year ago

First, thanks for this excellent work. However, I met some problems when reproducing the results of T2-xl.

My setting is:

Pretrained model and optimizer: I used the T5-v1_1-xl pretrained model and following the training setting in "Scaling Instruction-Finetuned Language Models": batch size 64, Dropout 0.05, LR 5e-4, 38K steps, adafactor optimizer.

Data: For the data, I first used the training data provided by SirNeural and evaluated the model on MMLU. When I equally sampled the 5 datasets (i.e. cot, flanv2, t0, diglog, niv2), I got 45% 5-shot accuracy on MMLU, which is similar to the w/o mixture balancing result in the paper. However, after I mixed the data with the suggested rates here, the accuracy is not improved (44%).

Afterwards, I tried the data provided by Enrico Shippole and mixed the data following the suggested rates. But the accuracy became worse (42% on MMLU). I also tried to use a larger batch size (128, considering batch packing ) and deduplicate the data, which nearly didn't help.

Are there any suggestions to reproduce the MMLU results of the released Flan-xl-t5 model (49%) or even the results in the paper(52%) ? Thanks a lot.

shayne-longpre commented 1 year ago

@danczs Thanks for the question.

A couple thoughts:

Sorry this could not be more helpful. It's hard to translate the internal code (which I no longer have access to) to external implementations. I would also note that my co-authors did A LOT of tuning and runs with internal configuration to get the 52% number. The variability per run with the same data can vary by 1-2% max performance, and then between checkpoints on the same run you might see another 1-2% variability even after its converged. (Just something to keep in mind.)

Best,

danczs commented 1 year ago

@shayne-longpre Thanks very much for your reply.

Thanks for your explanations, it helps a lot.

shayne-longpre commented 1 year ago

@danczs Hmm I'm not sure why it was so low. I noticed that a few recent papers seem to have gotten strong results with a 100k sample of the training data (e.g. https://arxiv.org/pdf/2306.04751.pdf) and their training code is public.

Also, maybe Hyung Won's recent comments provide some insights here?