Open devansh-shah-11 opened 4 weeks ago
@sanchit-gandhi Can I take this up and have a go at it?
Hey @pranav-bot! Do you have a reproducible code-snippet that I can run end-to-end for replicating the error locally on my side?
The following code-snippet loads Whisper large-v3 as the main model, Distil-Whisper distil-large-v3 as the assistant model, and transcribes a long audio file using both chunking and speculative decoding:
from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
assistant_model_id = "distil-whisper/distil-large-v3"
assistant_model = AutoModelForCausalLM.from_pretrained(
assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
generate_kwargs={"assistant_model": assistant_model},
chunk_length_s=30, # chunk audios into 30-second segments, and transcribe each using speculative decoding
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Print Output:
Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Upguards and Adam paintings, and Mason's exquisite idylls are as national as a jingo poem. Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, Next man!
=> we see that the audio is successfully transcribed!
Note that speculative decoding is currently only compatible with batch size 1. Thus, depending on the length of your audio, you might be better off using just the main model, but batching a long audio file and transcribing the chunks in parallel:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
This is going to be faster than speculative decoding for longer audio files, where the 2x speed-up from speculative decoding is outweighed by the speed-up of transcribing chunks of audio files in parallel (up to 9x for large batch sizes).
System Info
Transformers Version: 4.42.0 Python environment: 3.10.14
Who can help?
@sanchit-gandhi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am testing out a random audio (https://drive.google.com/file/d/11LrhCfcltRbsxCh3iyGnZPI59qyLoqDv/view?usp=sharing) for speculative decoding by passing audio as chunks based on silence. However, the inference isn't working.
Code to load audio chunk
def np_to_mp3(f, sr, x, normalized=False): """numpy array to MP3""" channels = 2 if (x.ndim == 2 and x.shape[1] == 2) else 1 if normalized: y = np.int16(x * 2 ** 15) else: y = np.int16(x) song = pydub.AudioSegment(y.tobytes(), frame_rate=sr, sample_width=2, channels=channels) song.export(f, format="mp3", bitrate="320k")
self.audio = audio_array.astype(np.float32) / INT16_MAX_ABS_VALUE audio_array = np.int16(audio * INT16_MAX_ABS_VALUE) fname="temp.mp3" np_to_mp3(fname, 16000, audio_array) result = pipe(fname)
Error Message
It gives EOFError
File "recorder.py", line 908, in transcribe status, result = self.parent_transcription_pipe.recv() File "miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError
Expected behavior
Speculative Decoding should work when passing audio chunks as well