0xSage commented 2 months ago

Goal

See: #56

Methodology

Encoder: WhisperVQ (audio to semantic tokens) with lastest v3 checkpoint supporting 7 more languages.
Significant Dataset improve:
- Scaling Speech Instruction data to 1.3 million sample also coming with high quality. We also curate 50k data that answer containing "As an AI language model" by using Distilable framework to repharse the answer to be more natural without overall content change.
- To recover the MMLU score, along with sound instruct data, we add text-only data for this run. We then introduce clean multiturn data from Magpie to the training data. Also Realise that model lack of capability of answer to natural simple query like "Hi, How are you?", so we mixed https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2.1 into our training data. Lastly, To address poor performance in mathematics, we've included 10,000 math-focused multi-turn conversations from the "Math-Multiturn-10K-ShareGPT" dataset: https://huggingface.co/datasets/PJMixers/Math-Multiturn-10K-ShareGPT.
- For transcription data, after some testing runs we observe that the introduce transcribe token may be the reason make the model degrade on MMLU score thats why we now use use 6 different transcription prompt instead.
Training config:
- Scaling max seq length to 4096

Hyperparams

Parameter	Value
Epoch	1
Global batch size	256
Learning Rate	7e-5
Learning Scheduler	LambdaLR with warmup
Optimizer	AdamW Fused
Warmup Steps	73
Weight Decay	0.005
Gradient Checkpointing	Full
Max length	4096
Precision	bf16

Results

MMLU:
- Checkpoint 7000:
- Checkpoint end epoch:
Audio-Bench:
- Speech Instruction (GPT-4 judge): ** Checkpoint 7000: Alpaca : 3.62 Open-hermes: 3.42
** Checkpoint end epoch (step 7300): Alpaca : 3.6 Open-hermes: 3.31
- ASR (WER score): ** Checkpoint 7000: Score~ 0.25
** Checkpoint end epoch (step 7300): Score~ 0.26

Learnings

Model return repetition response when processing the low quality audio

Quicklinks

Data:
- Full text-only data: https://huggingface.co/datasets/homebrewltd/instruction-text-only-full.
- Sound Instruct data: https://huggingface.co/datasets/homebrewltd/instruction-speech-whispervq-v3-subset-1 and https://huggingface.co/datasets/homebrewltd/instruction-speech-whispervq-v3-subset-2.
- Rephrase data: https://huggingface.co/datasets/homebrewltd/prompt-voice-v1-repharase.
Checkpoints:
- Checkpoint step 7000: https://huggingface.co/jan-hq/llama3-s-instruct-v0.3-checkpoint-7000.
- Checkpoint end epoch: https://huggingface.co/jan-hq/llama3-s-instruct-v0.3-checkpoint-last.

tikikun commented 2 months ago

Interesting observation

Model showing signs of adapting sound tokens to its understanding

Previous assumption: The model maps sound tokens to word tokens on a one-to-one basis.
Actual result: The model uses the context of the conversation (about using functions and function calling) to determine that it should output a name resembling a function, such as get_weather_info.

tikikun commented 2 months ago

Model having around 0.1 score for median (not mean) of W

ER on transcription test.

The model has some outlier cases. The model will keep repeating itself if it unfortunately stumble across the word "something", but it turnt out this issue also manifested in Llama3.1 original model.

Video of the same issue in llama3.1-instruct (not llama3-s)

https://github.com/user-attachments/assets/9725ec3e-50a3-4fda-ac12-8dd0cc4e550b

We probably will temporarily mitigate with some sampler config.

tikikun commented 2 months ago

Model can understand context, as you can see there is no sign in the audio that carrying information about whether the audio is a question, but the transcription carry the question mark.

bachvudinh commented 2 months ago

Updated @0xSage.

homebrewltd / ichigo

run: v0.3 phase 2 instruct and transcription tuning on new data #60

Goal

Methodology

Hyperparams

Results

Learnings

Quicklinks

Interesting observation