Open Sing303 opened 1 month ago
have you used any audio normalization? it is almost as important a process as the models themselves. Unfortunately, when I reviewed their papers/original repos (pyannote/whisper etc), I could not find a "general" normalization method that should be used to get the best results. What I did was mostly experimental.
Note that, Vibe uses:
pub fn normalize(input: PathBuf, output: PathBuf) -> Result<()> {
let ffmpeg_path = find_ffmpeg_path().context("ffmpeg not found")?;
tracing::debug!("ffmpeg path is {}", ffmpeg_path.display());
let mut cmd = Command::new(ffmpeg_path);
let cmd = cmd.stderr(Stdio::piped()).args([
"-i",
input.to_str().context("tostr")?,
"-ar",
"16000",
"-ac",
"1",
"-c:a",
"pcm_s16le",
"-af", // normalize loudness
"loudnorm=I=-16:TP=-1.5:LRA=11",
output.to_str().context("tostr")?,
"-hide_banner",
"-y",
"-loglevel",
"error",
]);
from what I learned from the speaker identification/whisper process is audio normalization plays a crucial part. I have no idea what is the best normalization to do, it's mostly experimental and different normalizations for different situations can give different results. This is especially obvious in parallel speech. For my tests though, I generally use gstreamer's audio normalization. They work really nicely.
https://github.com/sdroege/gstreamer-rs/tree/main/gstreamer-audio
After same normalization, the result is also different :) ffmpeg -i "6_speakers1.wav" -ar 16000 -ac 1 -c:a pcm_s16le -af "loudnorm=I=-16:TP=-1.5:LRA=11" "6_speakers.wav" -hide_banner -y -loglevel error
strange. Which one is more accurate?
Vibe more accurate
@thewh1teagle Any idea what the difference is?
I checked the output of the library and the result of the Vibe application (which uses it). Why are their results different?
Vibe
This Lib