Closed 0xSage closed 1 month ago
Can you update results of this experiment? @tikikun @bachvudinh
Model loss exploding when the data is switched from transcription -> instruction
Learning:
Conclusion: Data segmenting strategy is not currently working on cross-domain training in one go, for this run.
A few more questions:
Can you please provide some more details on what you observed when the model loss started exploding? What specific metrics or graphs did you use to track the issue, and what was the exact behavior you saw?
What are the implications now if we can't get it to work with segmented data? What are the next steps?
Can you elaborate on what you mean by 'recovery is slower' and how we can address it by expanding the dataset length?
I want to make sure we're making the most out of this experiment, given it was a substantial amount of burst compute. Even if things didn't work out as planned.
The learnings will will help us learn and improve for the next iteration. Thanks!
We observed that the loss stabilized during the first training run, which consisted of only the transcription data. However, when the model was converted to use instruction data, the loss spiked significantly and showed no tendency to decrease. You can find the data at 2e-4.txt.
Next steps:
^ ah yeah and these also, are tangible things we've taken to mitigate
- When training with the data mixing strategy , we have set different learning rate from 1 to 2.5e^-4 but observed the same loss pattern that the loss seem to go down fast to 1.3 in the first 350k transcription data but when it comes to sound instruction data the loss explode and then if i keep the training continue it seem to converges to 6.9-7.0.
- so i decided to shuffle the whole data like our previous training(v0.2) and the loss seem to go down really fast to 1.3 in just 1000 step and no sign that it will explode.
This is interesting. So even after playing around with the LRs, the model couldn't converge on a new task (i.e. translation -> instruction)
Q: We're going back to an even mix? Q: Is the bigger question whether training on Transcription task is necessary? iirc we introduced pretraining and transcription in the same run, without any ablation studies.
And thank you and @hahuyhoang411 for the late-nite troubleshooting ytd.
@bachvudinh please update final training result and we can close this ticket
MMLU Results of latest Run that we mixed the transciprtion and instruct data equally
-Checkpoint end epoch(7300):
close again, as this is already properly recorded
@hahuyhoang411 for taking note in writing process.
Problem
From @tikikun After benchmarking the pretraining checkpoint on MMLU, we observed a significant degradation in the model's text capabilities. The introduction of new multilingual data caused the model's performance to drop sharply, with the accuracy falling to approximately 0.377 at the last step checkpoint --> We need to set up a series of experimental runs to attempt recovering the MMLU performance during the instruction tuning stage.
Description
Instruction Task (after a shorter pretrain, 3k steps) Training time: ~ 3800 steps ~3 hours of training using 8xH100
Instruction Task (after a longer pretrain, 9k steps) Training time: ~ 3800 steps ~3 hours of training using 8xH100
Instruction & Transcription Task (after a longer pretrain, 9k steps) Training time: ~ 4800 steps ~4 hours of training using 8xH100
It seems that further pretrain will slow down the speed of MMLU recovery, but not too far off (55 vs 51)
Per qualitative checking, and scoring wise (alpaca/openhermes audio) it's cleary that further pretrain has positive impact on the model ability to converge to instruction ability
Adding filtering token and transcription in step 3 slow down the recovery speed of MMLU
Hypothesis
we suspect if we teach the model to do transcription first and then do instruction later, the recovery speed will stay the same without mixing data and we’re still able to introduce 1-1 mapping in transcription.
Results
Learnings