Closed 0xSage closed 1 month ago
Model showing signs of adapting sound tokens to its understanding
Previous assumption: The model maps sound tokens to word tokens on a one-to-one basis.
Actual result: The model uses the context of the conversation (about using functions and function calling) to determine that it should output a name resembling a function, such as get_weather_info
.
Model having around 0.1 score for median (not mean) of W
ER on transcription test.
The model has some outlier cases. The model will keep repeating itself if it unfortunately stumble across the word "something", but it turnt out this issue also manifested in Llama3.1 original model.
Video of the same issue in llama3.1-instruct (not llama3-s)
https://github.com/user-attachments/assets/9725ec3e-50a3-4fda-ac12-8dd0cc4e550b
We probably will temporarily mitigate with some sampler config.
Model can understand context, as you can see there is no sign in the audio that carrying information about whether the audio is a question, but the transcription carry the question mark.
Updated @0xSage.
Goal
See: #56
Methodology
Hyperparams
Results
MMLU:
Audio-Bench:
** Checkpoint end epoch (step 7300): Alpaca : 3.6 Open-hermes: 3.31
** Checkpoint end epoch (step 7300): Score~ 0.26
Learnings
Quicklinks
Data:
Full text-only data: https://huggingface.co/datasets/homebrewltd/instruction-text-only-full.
Sound Instruct data: https://huggingface.co/datasets/homebrewltd/instruction-speech-whispervq-v3-subset-1 and https://huggingface.co/datasets/homebrewltd/instruction-speech-whispervq-v3-subset-2.
Rephrase data: https://huggingface.co/datasets/homebrewltd/prompt-voice-v1-repharase.
Checkpoints: