homebrewltd / ichigo

Llama3.1 learns to Listen
154 stars 5 forks source link

docs: changelog of v0.3 vs v0.2 #76

Open tikikun opened 4 days ago

tikikun commented 4 days ago

Problem Statement

Currently there is no different log of v0.3 and v0.2 we should have to show the different

Idea

Create such log (with table compare 1 by 1 different in phase etc) to have a good checklist + to annouce the release better

bachvudinh commented 3 days ago

Version Comparison: v0.2 vs v0.3

Overall Comparison

Phase Aspect v0.2 v0.3
Pretraining Data Size 2.42M 3.87M
Data Source parler-tts/mls_eng_10k facebook/multilingual_librispeech
Data Synthetic Pipeline Using WhisperVQ(old checkpoint: whisper-vq-stoks-medium-en+pl.model) to tokenize english-only audio. Using latest checkpoint whisper-vq-stoks-v3-7lang.model for 8 lang audio.
Epoch 1 1
Global batch size 480 480
Learning Rate 2e-4 2e-4
Warmup Steps 80 50
Weight Decay 0.005 0.005
Max length 512 512
Precision bf16 bf16
Instruction Phase Data Size 929K 1.89M + 165k (phase 3)
Preprocessing Using rule-base to remove all hard-to-pronounce prompt Utilizing rule-based methods to filter out hard-to-pronounce prompts, and rephrasing certain LLM-generated responses to sound more natural and human-like.
Data Synthetic Pipeline Using old text-to-speech checkpoint to generate: t2s-small-yt.model then using whisper-vq-stoks-medium-en+pl.model to tokenize audio. Change t2s checkpoint to t2s-v1.1-small-en+pl.model and whisperVQ checkpoint to whisper-vq-stoks-v3-7lang.model.
Epoch 5 1
Global batch size 128 256
Gradient Acc Step per device 1 8
Learning Rate 1e-4 7e-5 and 1.5e-5 for phase 3
Warmup Steps 80 73 and 8 for phase 3
Weight Decay 0.005 0.005
Max length 1024 4096
Precision bf16 bf16

Instruction Phase Data Task Types

Task Type v0.2 v0.3
Speech Multiturn None 150k(Mostly 2 turns around 10k >=4 turns
Speech QA 679k samples 1.332M samples
Transcription 250k samples(Using a special token to denote a transcription task) 400k samples(Using 6 different prompts)
Noise Audio None 8k samples(Using Qwen2.5-72B to generate diverse synthetic answers for randomly generated sound tokens, with lengths matching the distribution of the Speech QA prompt)
Text-only None 150k samples including: 100k multiturn + 50k single turn

Performance