homebrewltd / llama3-s

Llama3.1 learns to Listen
140 stars 4 forks source link

experiment: Segmented Training to Recover MMLU #59

Open 0xSage opened 1 day ago

0xSage commented 1 day ago

Problem

From @tikikun After benchmarking the pretraining checkpoint on MMLU, we observed a significant degradation in the model's text capabilities. The introduction of new multilingual data caused the model's performance to drop sharply, with the accuracy falling to approximately 0.377 at the last step checkpoint --> We need to set up a series of experimental runs to attempt recovering the MMLU performance during the instruction tuning stage.

Description

Hypothesis

image

we suspect if we teach the model to do transcription first and then do instruction later, the recovery speed will stay the same without mixing data and we’re still able to introduce 1-1 mapping in transcription.

Results

Learnings

0xSage commented 1 day ago

Can you update results of this experiment? @tikikun @bachvudinh

tikikun commented 1 day ago

Model loss exploding when the data is switched from transcription -> instruction

Screenshot 2024-09-17 at 12 36 36

Learning:

Conclusion: Data segmenting strategy is not currently working on cross-domain training in one go, for this run.

tikikun commented 1 day ago

2e-4.txt

0xSage commented 1 day ago

A few more questions:

  1. Can you please provide some more details on what you observed when the model loss started exploding? What specific metrics or graphs did you use to track the issue, and what was the exact behavior you saw?

  2. What are the implications now if we can't get it to work with segmented data? What are the next steps?

  3. Can you elaborate on what you mean by 'recovery is slower' and how we can address it by expanding the dataset length?

I want to make sure we're making the most out of this experiment, given it was a substantial amount of burst compute. Even if things didn't work out as planned.

The learnings will will help us learn and improve for the next iteration. Thanks!

tikikun commented 1 day ago
  1. We segmented the dataset into two segments:
    • The first 30% of the dataset contained only transcription data.
    • The last 70% of the dataset contained only instruction data.

We observed that the loss stabilized during the first training run, which consisted of only the transcription data. However, when the model was converted to use instruction data, the loss spiked significantly and showed no tendency to decrease. You can find the data at 2e-4.txt.

  1. Implications:
    • We were unable to get the model to recover the MMLU score faster based on segmenting the dataset according to our theory as described in the issue description.
    • A mixed dataset remains the most straightforward way for cross-domain training, at least in our case.

Next steps:

  1. In theory, the MMLU score will recover based on the number of training steps taken linearly, as per our original observation. Therefore, if we have more data and train longer, the MMLU score can potentially recover to a much better number.
bachvudinh commented 1 day ago
  1. When training with the data mixing strategy , we have set different learning rate from 1 to 2.5e^-4 but observed the same loss pattern that the loss seem to go down fast to 1.3 in the first 350k transcription data but when it comes to sound instruction data the loss explode and then if i keep the training continue it seem to converges to 6.9-7.0.
  2. so i decided to shuffle the whole data like our previous training(v0.2) and the loss seem to go down really fast to 1.3 in just 1000 step and no sign that it will explode.
tikikun commented 1 day ago

^ ah yeah and these also, are tangible things we've taken to mitigate

0xSage commented 1 day ago
  1. When training with the data mixing strategy , we have set different learning rate from 1 to 2.5e^-4 but observed the same loss pattern that the loss seem to go down fast to 1.3 in the first 350k transcription data but when it comes to sound instruction data the loss explode and then if i keep the training continue it seem to converges to 6.9-7.0.
  2. so i decided to shuffle the whole data like our previous training(v0.2) and the loss seem to go down really fast to 1.3 in just 1000 step and no sign that it will explode.

This is interesting. So even after playing around with the LRs, the model couldn't converge on a new task (i.e. translation -> instruction)

Q: We're going back to an even mix? Q: Is the bigger question whether training on Transcription task is necessary? iirc we introduced pretraining and transcription in the same run, without any ablation studies.

And thank you and @hahuyhoang411 for the late-nite troubleshooting ytd.

tikikun commented 1 day ago
tikikun commented 1 day ago

@bachvudinh please update final training result and we can close this ticket

bachvudinh commented 23 hours ago

MMLU Results of latest Run that we mixed the transciprtion and instruct data equally

Image

Image

Image

Image

Image

-Checkpoint end epoch(7300):

Image