homebrewltd / ichigo

Local realtime voice AI
Apache License 2.0
1.93k stars 92 forks source link

run: v0.3 phase 2 instruct and transcription tuning on new data #60

Closed 0xSage closed 1 month ago

0xSage commented 2 months ago

Goal

See: #56

Methodology

Hyperparams

Parameter Value
Epoch 1
Global batch size 256
Learning Rate 7e-5
Learning Scheduler LambdaLR with warmup
Optimizer AdamW Fused
Warmup Steps 73
Weight Decay 0.005
Gradient Checkpointing Full
Max length 4096
Precision bf16

Results

Learnings

Quicklinks

tikikun commented 2 months ago

Interesting observation

Model showing signs of adapting sound tokens to its understanding

Screenshot 2024-09-18 at 10 47 09

Previous assumption: The model maps sound tokens to word tokens on a one-to-one basis.
Actual result: The model uses the context of the conversation (about using functions and function calling) to determine that it should output a name resembling a function, such as get_weather_info.

tikikun commented 2 months ago

Model having around 0.1 score for median (not mean) of W

ER on transcription test.

The model has some outlier cases. The model will keep repeating itself if it unfortunately stumble across the word "something", but it turnt out this issue also manifested in Llama3.1 original model.

Video of the same issue in llama3.1-instruct (not llama3-s)

https://github.com/user-attachments/assets/9725ec3e-50a3-4fda-ac12-8dd0cc4e550b

We probably will temporarily mitigate with some sampler config.

tikikun commented 2 months ago

Model can understand context, as you can see there is no sign in the audio that carrying information about whether the audio is a question, but the transcription carry the question mark.

Screenshot 2024-09-18 at 12 36 21
bachvudinh commented 2 months ago

Updated @0xSage.