epic: llama3-s v0.3: "I cannot hear / understand you"

0xSage commented 1 week ago

Goal

Make v0.3 multilingual, accept longer questions, and other data improvements.

Problem

Previously v0.2 only worked well on instructions under 10s
Previously v0.2 only worked well on English input
Previously v0.2 couldn't distinguish noise/non-speech and would give a nonsensical answer
Previously v0.2 gave a lot of "as an ai model, I cannot..." responses which is quite outdated for this generation of LLMs

Methodology

To solve the above mentioned issues, this run is focused on data improvements

⌛ Cross domain: data with noise (not yet)
✅ Cross domain: adding in more languages from OpenSLR (3.8mn examples)
⌛ Update open instruct datasets: e.g. “as an ai model”
⌛ Multi-turn (pending tokenization)
⌛ Scaling sequence length (with sequence length warmup?)

Pipeline improvements:

Update WhisperVQ module per community feedback
Refactoring our data pipelines, code, updating deps

Data Resources

fb/librispeech: synthetic semantic token generation [done] (using Distilabel)

Training Resources

Results

## Eval - Perf: MMLU (instruction), some ASR (transcription), human hieuristics - Hardware: ## Challenges & Learnings - # Tasklist - [x] #60 - [x] #59 - [ ] #38

0xSage commented 1 week ago

From Bach:

Phase 1: Pre-training

Data Sources: We gathered 2.42M English audio files (MLS train set) the same pretrain data as the previous run but recreated with the new WhisperVQ checkpoint. Futhermore, i collected more 1.3M audio for 7 languages from facebook/librispeech:
- Dutch: 374k
- Frech: 258k
- German: 470k
- Italian: 62.1k
- Polish: 26.1k
- Portuguese: 39.2k
- Spanish: 221k
Max: 503 tokens
Average number of tokens:
Total number of tokens:
Training Config:
- Hardware: 10xa6000
- Training time: 8064 steps (~48 hours)