huggingface / speech-to-speech

Speech To Speech: an effort for an open-sourced and modular GPT4-o
Apache License 2.0
3k stars 318 forks source link

Feature Suggestion: Diarization #10

Open TheMattBin opened 1 month ago

TheMattBin commented 1 month ago

It there any plan to add feature of diarization? Thanks for the great work!

thisdotmatt commented 1 month ago

This is certainly feasible, as WhisperX offers diarization through their Diarization Pipeline:

    diarize_model = DiarizationPipeline(use_auth_token=hf_token, device=device)
    for result, input_audio_path in tmp_results:
        diarize_segments = diarize_model(input_audio_path, min_speakers=min_speakers, max_speakers=max_speakers)
        result = assign_word_speakers(diarize_segments, result)
        results.append((result, input_audio_path))

That said, the way an LLM would interpret multiple speakers is unclear to me.