Feature Suggestion: Diarization

This is certainly feasible, as WhisperX offers diarization through their Diarization Pipeline:

    diarize_model = DiarizationPipeline(use_auth_token=hf_token, device=device)
    for result, input_audio_path in tmp_results:
        diarize_segments = diarize_model(input_audio_path, min_speakers=min_speakers, max_speakers=max_speakers)
        result = assign_word_speakers(diarize_segments, result)
        results.append((result, input_audio_path))

That said, the way an LLM would interpret multiple speakers is unclear to me.

huggingface / speech-to-speech

Feature Suggestion: Diarization #10