Add ASR + SD pipeline - Githubissues

Adds pipeline for automatic speech recognition (ASR) + speaker diarization (SD)

SD is performed using pyannote's pre-trained SD pipeline, which returns timestamps for 'who spoke when'
ASR is performed using HF's pipeline, which returns utterance-level transcriptions and their corresponding timestamps. One can specify the checkpoint of any pre-trained ASR model on the Hub to use for inference (e.g. Whisper tiny)
The SD and ASR timestamps are aligned to retrieve speaker-segmented transcriptions

Example:

import torch
from speechbox import ASRDiarizationPipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-tiny", device=device)

# load dataset of concatenated LibriSpeech samples
concatenated_librispeech = load_dataset("sanchit-gandhi/concatenated_librispeech", split="train", streaming=True)
# get first sample
sample = next(iter(concatenated_librispeech))

out = pipeline(sample["audio"])
# format the transcriptions nicely for printout
print("\n\n".join([chunk["speaker"] + " " + str((round(chunk["timestamp"][0], 1), round(chunk["timestamp"][1], 1))) +  chunk["text"] for chunk in out]))

Print Output:

SPEAKER_01 (0.0, 15.0) Chapter 16 I might have told you of the beginning of this liaison in a few lines, but I wanted you to see every step by which we came. I to agree to whatever Mark Reid wished.

SPEAKER_00 (15.0, 22.0) He was in a fevered state of mind, owing to the blight his wife's action threatened to cast upon his entire future.

huggingface / speechbox

Add ASR + SD pipeline #9