homebrewltd / ichigo

Llama3.1 learns to Listen
Apache License 2.0
1.7k stars 78 forks source link

experiment: Segmented Training to Recover MMLU #59

Closed 0xSage closed 1 month ago

0xSage commented 1 month ago

Problem

From @tikikun After benchmarking the pretraining checkpoint on MMLU, we observed a significant degradation in the model's text capabilities. The introduction of new multilingual data caused the model's performance to drop sharply, with the accuracy falling to approximately 0.377 at the last step checkpoint --> We need to set up a series of experimental runs to attempt recovering the MMLU performance during the instruction tuning stage.

Description

Hypothesis

image

we suspect if we teach the model to do transcription first and then do instruction later, the recovery speed will stay the same without mixing data and we’re still able to introduce 1-1 mapping in transcription.

Results

Learnings

0xSage commented 1 month ago

Can you update results of this experiment? @tikikun @bachvudinh

tikikun commented 1 month ago

Model loss exploding when the data is switched from transcription -> instruction

Screenshot 2024-09-17 at 12 36 36

Learning:

Conclusion: Data segmenting strategy is not currently working on cross-domain training in one go, for this run.

tikikun commented 1 month ago

2e-4.txt

0xSage commented 1 month ago

A few more questions:

  1. Can you please provide some more details on what you observed when the model loss started exploding? What specific metrics or graphs did you use to track the issue, and what was the exact behavior you saw?

  2. What are the implications now if we can't get it to work with segmented data? What are the next steps?

  3. Can you elaborate on what you mean by 'recovery is slower' and how we can address it by expanding the dataset length?

I want to make sure we're making the most out of this experiment, given it was a substantial amount of burst compute. Even if things didn't work out as planned.

The learnings will will help us learn and improve for the next iteration. Thanks!

tikikun commented 1 month ago
  1. We segmented the dataset into two segments:
    • The first 30% of the dataset contained only transcription data.
    • The last 70% of the dataset contained only instruction data.

We observed that the loss stabilized during the first training run, which consisted of only the transcription data. However, when the model was converted to use instruction data, the loss spiked significantly and showed no tendency to decrease. You can find the data at 2e-4.txt.

  1. Implications:
    • We were unable to get the model to recover the MMLU score faster based on segmenting the dataset according to our theory as described in the issue description.
    • A mixed dataset remains the most straightforward way for cross-domain training, at least in our case.

Next steps:

  1. In theory, the MMLU score will recover based on the number of training steps taken linearly, as per our original observation. Therefore, if we have more data and train longer, the MMLU score can potentially recover to a much better number.
bachvudinh commented 1 month ago
  1. When training with the data mixing strategy , we have set different learning rate from 1 to 2.5e^-4 but observed the same loss pattern that the loss seem to go down fast to 1.3 in the first 350k transcription data but when it comes to sound instruction data the loss explode and then if i keep the training continue it seem to converges to 6.9-7.0.
  2. so i decided to shuffle the whole data like our previous training(v0.2) and the loss seem to go down really fast to 1.3 in just 1000 step and no sign that it will explode.
tikikun commented 1 month ago

^ ah yeah and these also, are tangible things we've taken to mitigate

0xSage commented 1 month ago
  1. When training with the data mixing strategy , we have set different learning rate from 1 to 2.5e^-4 but observed the same loss pattern that the loss seem to go down fast to 1.3 in the first 350k transcription data but when it comes to sound instruction data the loss explode and then if i keep the training continue it seem to converges to 6.9-7.0.
  2. so i decided to shuffle the whole data like our previous training(v0.2) and the loss seem to go down really fast to 1.3 in just 1000 step and no sign that it will explode.

This is interesting. So even after playing around with the LRs, the model couldn't converge on a new task (i.e. translation -> instruction)

Q: We're going back to an even mix? Q: Is the bigger question whether training on Transcription task is necessary? iirc we introduced pretraining and transcription in the same run, without any ablation studies.

And thank you and @hahuyhoang411 for the late-nite troubleshooting ytd.

tikikun commented 1 month ago
tikikun commented 1 month ago

@bachvudinh please update final training result and we can close this ticket

bachvudinh commented 1 month ago

MMLU Results of latest Run that we mixed the transciprtion and instruct data equally

Image

Image

Image

Image

Image

-Checkpoint end epoch(7300):

Image

tikikun commented 1 month ago

close again, as this is already properly recorded

@hahuyhoang411 for taking note in writing process.