Open 0xSage opened 1 week ago
From Bach:
Data Sources: We gathered 2.42M English audio files (MLS train set) the same pretrain data as the previous run but recreated with the new WhisperVQ checkpoint. Futhermore, i collected more 1.3M audio for 7 languages from facebook/librispeech:
Max: 503 tokens
Average number of tokens:
Total number of tokens:
Training Config:
Data source: https://github.com/homebrewltd/llama3-s/issues/53
@bachvudinh mind updating results from phase 1 here when you have it? Thanks!
@0xSage added centralized data source for the epic
Loss: Converges to 1.9-2.0.
MMLU score:
Latest run result is not good, we tried to finetune using very low r to avoid degradation, it still happens.
r=4 alpha=4 LR ~ 8e-6
Goal
Make v0.3 multilingual, accept longer questions, and other data improvements.
Problem
Methodology
To solve the above mentioned issues, this run is focused on data improvements
Pipeline improvements:
Data Resources
Training Resources
Results