thewh1teagle / pyannote-rs

pyannote audio diarization in rust
http://crates.io/crates/pyannote-rs
MIT License
32 stars 3 forks source link

Why is the output different from Vibe's work? #9

Open Sing303 opened 1 month ago

Sing303 commented 1 month ago

I checked the output of the library and the result of the Vibe application (which uses it). Why are their results different?

Vibe Fhli3JHyMk

This Lib nxnN2EAiRU

altunenes commented 1 month ago

have you used any audio normalization? it is almost as important a process as the models themselves. Unfortunately, when I reviewed their papers/original repos (pyannote/whisper etc), I could not find a "general" normalization method that should be used to get the best results. What I did was mostly experimental.

Note that, Vibe uses:

https://github.com/thewh1teagle/vibe/blob/276a6a20b711ecb6c1aa080d1906ab83269626a0/core/src/audio.rs#L54C1-L75C8

pub fn normalize(input: PathBuf, output: PathBuf) -> Result<()> {
    let ffmpeg_path = find_ffmpeg_path().context("ffmpeg not found")?;
    tracing::debug!("ffmpeg path is {}", ffmpeg_path.display());

    let mut cmd = Command::new(ffmpeg_path);
    let cmd = cmd.stderr(Stdio::piped()).args([
        "-i",
        input.to_str().context("tostr")?,
        "-ar",
        "16000",
        "-ac",
        "1",
        "-c:a",
        "pcm_s16le",
        "-af", // normalize loudness
        "loudnorm=I=-16:TP=-1.5:LRA=11",
        output.to_str().context("tostr")?,
        "-hide_banner",
        "-y",
        "-loglevel",
        "error",
    ]);

from what I learned from the speaker identification/whisper process is audio normalization plays a crucial part. I have no idea what is the best normalization to do, it's mostly experimental and different normalizations for different situations can give different results. This is especially obvious in parallel speech. For my tests though, I generally use gstreamer's audio normalization. They work really nicely.

https://github.com/sdroege/gstreamer-rs/tree/main/gstreamer-audio

Sing303 commented 1 month ago

After same normalization, the result is also different :) ffmpeg -i "6_speakers1.wav" -ar 16000 -ac 1 -c:a pcm_s16le -af "loudnorm=I=-16:TP=-1.5:LRA=11" "6_speakers.wav" -hide_banner -y -loglevel error

image

altunenes commented 1 month ago

strange. Which one is more accurate?

Sing303 commented 1 month ago

Vibe more accurate

Sing303 commented 1 month ago

@thewh1teagle Any idea what the difference is?