Speculative Decoding for chunked audios

System Info

Transformers Version: 4.42.0 Python environment: 3.10.14

Who can help?

@sanchit-gandhi

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I am testing out a random audio (https://drive.google.com/file/d/11LrhCfcltRbsxCh3iyGnZPI59qyLoqDv/view?usp=sharing) for speculative decoding by passing audio as chunks based on silence. However, the inference isn't working.

Code to load audio chunk

def np_to_mp3(f, sr, x, normalized=False): """numpy array to MP3""" channels = 2 if (x.ndim == 2 and x.shape[1] == 2) else 1 if normalized: y = np.int16(x * 2 ** 15) else: y = np.int16(x) song = pydub.AudioSegment(y.tobytes(), frame_rate=sr, sample_width=2, channels=channels) song.export(f, format="mp3", bitrate="320k")

self.audio = audio_array.astype(np.float32) / INT16_MAX_ABS_VALUE audio_array = np.int16(audio * INT16_MAX_ABS_VALUE) fname="temp.mp3" np_to_mp3(fname, 16000, audio_array) result = pipe(fname)

Error Message

It gives EOFError

File "recorder.py", line 908, in transcribe status, result = self.parent_transcription_pipe.recv() File "miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

Expected behavior

Speculative Decoding should work when passing audio chunks as well

Hey @pranav-bot! Do you have a reproducible code-snippet that I can run end-to-end for replicating the error locally on my side?

The following code-snippet loads Whisper large-v3 as the main model, Distil-Whisper distil-large-v3 as the assistant model, and transcribes a long audio file using both chunking and speculative decoding:

from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

assistant_model_id = "distil-whisper/distil-large-v3"

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    chunk_length_s=30,  # chunk audios into 30-second segments, and transcribe each using speculative decoding
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Print Output:

 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Upguards and Adam paintings, and Mason's exquisite idylls are as national as a jingo poem. Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, Next man!

=> we see that the audio is successfully transcribed!

Note that speculative decoding is currently only compatible with batch size 1. Thus, depending on the length of your audio, you might be better off using just the main model, but batching a long audio file and transcribing the chunks in parallel:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

This is going to be faster than speculative decoding for longer audio files, where the 2x speed-up from speculative decoding is outweighed by the speed-up of transcribing chunks of audio files in parallel (up to 9x for large batch sizes).

huggingface / transformers