experiment: Segmented Training to Recover MMLU

0xSage commented 1 month ago

Problem

From @tikikun After benchmarking the pretraining checkpoint on MMLU, we observed a significant degradation in the model's text capabilities. The introduction of new multilingual data caused the model's performance to drop sharply, with the accuracy falling to approximately 0.377 at the last step checkpoint --> We need to set up a series of experimental runs to attempt recovering the MMLU performance during the instruction tuning stage.

Description

Instruction Task (after a shorter pretrain, 3k steps) Training time: ~ 3800 steps ~3 hours of training using 8xH100
Instruction Task (after a longer pretrain, 9k steps) Training time: ~ 3800 steps ~3 hours of training using 8xH100
Instruction & Transcription Task (after a longer pretrain, 9k steps) Training time: ~ 4800 steps ~4 hours of training using 8xH100
It seems that further pretrain will slow down the speed of MMLU recovery, but not too far off (55 vs 51)
Per qualitative checking, and scoring wise (alpaca/openhermes audio) it's cleary that further pretrain has positive impact on the model ability to converge to instruction ability
Adding filtering token and transcription in step 3 slow down the recovery speed of MMLU

Hypothesis

we suspect if we teach the model to do transcription first and then do instruction later, the recovery speed will stay the same without mixing data and we’re still able to introduce 1-1 mapping in transcription.

The model can recover drastically from MMLU 0.3 regardless of longer pretrain, we need more instruction tune data (which we already have, the test is just on subset)
Mixed transcription may cause unwanted effects on MMLU ability -> replace transcription with segmenting strategy, the dataset will have only transcription data in the beginning and change to instruction on the latter part

Results

Learnings

0xSage commented 1 month ago

Can you update results of this experiment? @tikikun @bachvudinh

tikikun commented 1 month ago

Model loss exploding when the data is switched from transcription -> instruction

Learning:

Mixed dataset is still required
Since the recovery is slower, we need to expand the dataset len

Conclusion: Data segmenting strategy is not currently working on cross-domain training in one go, for this run.

tikikun commented 1 month ago

2e-4.txt

0xSage commented 1 month ago

A few more questions:

Can you please provide some more details on what you observed when the model loss started exploding? What specific metrics or graphs did you use to track the issue, and what was the exact behavior you saw?
What are the implications now if we can't get it to work with segmented data? What are the next steps?
Can you elaborate on what you mean by 'recovery is slower' and how we can address it by expanding the dataset length?

I want to make sure we're making the most out of this experiment, given it was a substantial amount of burst compute. Even if things didn't work out as planned.

The learnings will will help us learn and improve for the next iteration. Thanks!

tikikun commented 1 month ago

We segmented the dataset into two segments:
- The first 30% of the dataset contained only transcription data.
- The last 70% of the dataset contained only instruction data.

We observed that the loss stabilized during the first training run, which consisted of only the transcription data. However, when the model was converted to use instruction data, the loss spiked significantly and showed no tendency to decrease. You can find the data at 2e-4.txt.

Implications:
- We were unable to get the model to recover the MMLU score faster based on segmenting the dataset according to our theory as described in the issue description.
- A mixed dataset remains the most straightforward way for cross-domain training, at least in our case.

Next steps:

We will revert to using a mixed dataset.
If required, we will further explore alternative ways to segment the dataset, such as dividing it into more than two pieces. However, this would require significant additional time and computational resources, so we will prioritize reverting to the mixed dataset for now.

In theory, the MMLU score will recover based on the number of training steps taken linearly, as per our original observation. Therefore, if we have more data and train longer, the MMLU score can potentially recover to a much better number.

bachvudinh commented 1 month ago

When training with the data mixing strategy , we have set different learning rate from 1 to 2.5e^-4 but observed the same loss pattern that the loss seem to go down fast to 1.3 in the first 350k transcription data but when it comes to sound instruction data the loss explode and then if i keep the training continue it seem to converges to 6.9-7.0.
so i decided to shuffle the whole data like our previous training(v0.2) and the loss seem to go down really fast to 1.3 in just 1000 step and no sign that it will explode.

tikikun commented 1 month ago

^ ah yeah and these also, are tangible things we've taken to mitigate

0xSage commented 1 month ago

When training with the data mixing strategy , we have set different learning rate from 1 to 2.5e^-4 but observed the same loss pattern that the loss seem to go down fast to 1.3 in the first 350k transcription data but when it comes to sound instruction data the loss explode and then if i keep the training continue it seem to converges to 6.9-7.0.

so i decided to shuffle the whole data like our previous training(v0.2) and the loss seem to go down really fast to 1.3 in just 1000 step and no sign that it will explode.

This is interesting. So even after playing around with the LRs, the model couldn't converge on a new task (i.e. translation -> instruction)

Q: We're going back to an even mix? Q: Is the bigger question whether training on Transcription task is necessary? iirc we introduced pretraining and transcription in the same run, without any ablation studies.

And thank you and @hahuyhoang411 for the late-nite troubleshooting ytd.

tikikun commented 1 month ago

Yes
Yes, it is already proven required in llama3-s 0.2, without transcription task, it will confuse the tokens a bit too much and does not recognize the words that well.

tikikun commented 1 month ago

@bachvudinh please update final training result and we can close this ticket

bachvudinh commented 1 month ago

MMLU Results of latest Run that we mixed the transciprtion and instruct data equally