huggingface / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
11.75k stars 734 forks source link

Absolute speaker diarization? #873

Closed flatsiedatsie closed 2 months ago

flatsiedatsie commented 3 months ago

Question

I've just managed to integrate the new speaker diarization feature into my project. Very cool stuff. My goal is to let people record meetings, summarize them, and then also list per-speaker tasks. This seems to be a popular feature.

One thing I'm running into is that I don't feed Whisper a single long audio file. Instead I use VAD to feed it small chunks of live audio whenever someone speaks.

However, as far as I can tell the speaker diarization only works "relatively", detecting speakers within a single audio file.

Is there a way to let it detect and 'sort' the correct speaker over multiple audio files? Perhaps it could remember the 'audio fingerprints' of the speakers somehow?

record_meeting

flatsiedatsie commented 3 months ago

Going through the sourcecode a bit more I found an existing ability for speaker verification.

My plan is to:

Going though the code I also noticed that only up to three speakers can be separated with diarization. But with these short snippets of audio and the verification mechanism, that would no longer be a limitation. Win-Win!

flatsiedatsie commented 3 months ago

I've got it somewhat working!

I'm testing it now.

One odd thing that just happened is that all the chunks of a recording (me doing a chipmunk voice to pretend to be a second person) got the same timestamp (5.2). Screenshot:

Screenshot 2024-07-31 at 18 10 51
MatteoFasulo commented 2 months ago

Question

I've just managed to integrate the new speaker diarization feature into my project. Very cool stuff. My goal is to let people record meetings, summarize them, and then also list per-speaker tasks. This seems to be a popular feature.

One thing I'm running into is that I don't feed Whisper a single long audio file. Instead I use VAD to feed it small chunks of live audio whenever someone speaks.

However, as far as I can tell the speaker diarization only works "relatively", detecting speakers within a single audio file.

Is there a way to let it detect and 'sort' the correct speaker over multiple audio files? Perhaps it could remember the 'audio fingerprints' of the speakers somehow?

record_meeting record_meeting

This is very cool! Could be good also for extracting clips from podcasts or YouTube videos with many speakers 👍🏼

flatsiedatsie commented 2 months ago

It turned out even cooler than that :-)

I ask a speaker with a new fingerprint to first say "I consent to recording my voice". Only once they've said that will their contribution show up. Otherwise it just says "Redacted - no consent".

I also made it so that you can say "My name is X", and from then on it will preface your contribution with your name instead of "Speaker0", etc.

Screenshot 2024-08-02 at 18 25 36
eschmidbauer commented 2 months ago

could you share you implemented VAD?

flatsiedatsie commented 2 months ago

@eschmidbauer Have a look here for an easy to use one: https://github.com/ricky0123/vad

flatsiedatsie commented 2 months ago

For people finding this thread: you may also want to look at this recently added wespeaker-voxceleb-resnet34model that is also designed to create audio fingerprints for voices. It says it only supports English and Chinese. I haven't tried it yet, but I'm curious to see how it would compare to wavlm-base-plus-sv, since wavlm-base-plus-sv isn't great. But I might just be using it wrong (feeding it too much or too little data, etc).

Here's my final pipeline:

class PipelineSingleton {
    static asr_model_id = 'onnx-community/whisper-base_timestamped';
    static instance = null;
    static asr_instance = null;

    static segmentation_model_id = 'onnx-community/pyannote-segmentation-3.0';
    static segmentation_instance = null;
    static segmentation_processor = null;

    static verification_model_id = 'Xenova/wavlm-base-plus-sv';
    static verification_instance = null;
    static verification_processor = null;

    static async getInstance(progress_callback = null,model_name='onnx-community/whisper-base_timestamped',preferences={}) {
        console.log("Whisper_worker: Pipeline: getInstance:  model_name, preferences: ", model_name, preferences);
        this.asr_model_id = model_name;

        PER_DEVICE_CONFIG[self.device] = {...PER_DEVICE_CONFIG[self.device],preferences}

        this.asr_instance ??= pipeline('automatic-speech-recognition', this.asr_model_id, {
            ...PER_DEVICE_CONFIG[self.device],
            progress_callback,
        });

        this.segmentation_processor ??= AutoProcessor.from_pretrained(this.segmentation_model_id, {
            ...preferences,
            progress_callback,
        });
        this.segmentation_instance ??= AutoModelForAudioFrameClassification.from_pretrained(this.segmentation_model_id, {
            // NOTE: WebGPU is not currently supported for this model
            // See https://github.com/microsoft/onnxruntime/issues/21386
            device: 'wasm',
            dtype: 'fp32',
            ...preferences,
            progress_callback,
        });

        this.verification_processor ??= AutoProcessor.from_pretrained(this.verification_model_id, {
            device: 'wasm',
            dtype: 'fp32',
            ...preferences,
            progress_callback,
        });

        this.verification_instance ??= AutoModel.from_pretrained(this.verification_model_id, {
            device: 'wasm',
            dtype: 'fp32',
            ...preferences,
            progress_callback,
        });

        return Promise.all([this.asr_instance, this.segmentation_processor, this.segmentation_instance, this.verification_processor, this.verification_instance]);
    }
}

Be advised that you'll also need to write a lot of code to 'clean up' the output from the various models.

Here's some example code for how I 'cleaned up' the segments:

    let last_speaker_id = null;
    let joined_segment_end = null;

    for(let s = 0; s < segments.length; s++){
        segments[s]['original_id'] = segments[s].id;

        if(segments[s].id > 0 && segments[s].id < 4){
            last_speaker_id = segments[s].id;
        }

        // Sometimes there are weird, very short segments at the beginning, less than a tenth of a second long. This causes them to be pruned later
        if(s < 3 && segments[s].start < (s * 0.1)){
            segments[s].start = 0;
        }
        joined_segment_end = segments[s].end;
    }

    for(let s = segments.length - 1; s >= 0; --s){
        //console.log("segment: ", s);
        if(typeof segments[s] == 'undefined'){
            console.error("segment no longer existed at position: ", s);
            continue
        }

        if(typeof segments[s].id == 'number' && typeof segments[s].confidence == 'number' && typeof segments[s].start == 'number' && typeof segments[s].end == 'number'){

            // only speaker ID's of 1, 2 or 3 refer to individual speakers. zero means no speaker (silence), and 4 and above is for combinations of speakers (speaking at the same time)
            // TODO: this just steamrolls over mixed speakers, assigning the ID of the speaker that ends up speaking on it's own afterwards.
            if( (segments[s].id == 0 || segments[s].id >= 4) && last_speaker_id != null){

                //console.log("changing a segment's ID.  old -> new, and duration: ", segments[s].id,last_speaker_id,segments[s].end - segments[s].start);
                segments[s].id = last_speaker_id;
            }

            if(segments[s].id > 0 && segments[s].id < 4){
                //console.log("segment has good single speaker ID: ", segments[s].id);

                if(segments[s].id != last_speaker_id){
                    //console.log("switching to another speaker");
                    last_speaker_id = segments[s].id;
                    joined_segment_end = segments[s].end;
                }
                else{
                    //console.log("still the same speaker speaking");
                    if(joined_segment_end != null){
                        if(joined_segment_end != segments[s].end){
                            segments[s].end = joined_segment_end;

                            if(typeof segments[s + 1] != 'undefined' && segments[s+1].id == segments[s].id && reached_zero == false){
                                //console.log("removing older segment with the same ID as this one");
                                segments.splice(s + 1, 1);
                            }
                        }

                    }

                }

            }
            // TODO: Could distinguish between silence and mixed speakers here
            else{
                console.error("segment has bad speaker ID: ", segments[s]);
            }

            // Remove very short segments from the beginning
            if(segments[s].start == 0 && reached_zero == true){
                segments.splice(s, 1);
            }
            else if(segments[s].start == 0 && reached_zero == false){
                reached_zero = true;
            }

            //console.log("reached_zero: ", reached_zero);
            //console.log("joined_segment_end: ", joined_segment_end);

        }
        else{
            console.error("segment was missing basic attributes: ", segments[s]);
        }

    }

I keep a dictionary of voice fingerprints related to speaker ID's.

for(let f = 0; f < self.fingerprints.length; f++){

        if(fingerprints_to_skip.indexOf(f) != -1){
            // skip, already used
        }
        else{
            if(typeof self.fingerprints[f].embedding != 'undefined'){
                //console.log("verify segment: comparing ", f, self.fingerprints[f].embedding, logit_embedding);
                try{
                    const similarity = cosinesim(self.fingerprints[f].embedding,logit_embedding);
                    console.log("verify segment: SIMILARITY: ", f, similarity);
                    fingerprint_matches.push(similarity);
                    if(similarity > highest_match){
                        highest_match = similarity;
                        if(similarity > 0.94){
                            found_id = f;
                        }

                    }
                }
                catch(err){
                    console.error("verify segment: error doing similarity check: ", err);
                }
            }
        }

}

And here's the cosine simularity function I use to compare the voice fingerprints:

function cosinesim(A,B){
    var dotproduct=0;
    var mA=0;
    var mB=0;
    for(let i = 0; i < A.length; i++){
        dotproduct += (A[i] * B[i]);
        mA += (A[i]*A[i]);
        mB += (B[i]*B[i]);
    }
    mA = Math.sqrt(mA);
    mB = Math.sqrt(mB);
    var similarity = (dotproduct)/((mA)*(mB))
    return similarity;
}

Since it's somewhat working now I'll close this issue.