Closed juansebashr closed 9 months ago
FYI, I was running a comparison between the base model of OpenAI Whisper and that model works out just fine
Hey @juansebashr - could you share the end-to-end script you're using for this benchmark? At a first glance I would check that the tokenizer you're using is the correct one for this model. E.g. if benchmarking distil-whisper/distil-medium.en
, that you are loading the tokenizer that corresponds to this checkpoint
Of course @sanchit-gandhi , here is the script, is the same as the colab
# -*- coding: utf-8 -*-
"""Distil_Whisper_Benchmark.ipynb
## Benchmarking
Great, now that we've understood why Distil-Whisper should be faster in theory, let's see if it holds true in practice.
To begin with, we install `transformers`, `accelerate`, and `datasets`.
In this notebook, we use a A100 GPU that is available through a Colab pro subscription, as this is the device we used for benchmarking in the [Distil-Whisper paper](https://huggingface.co/papers/2311.00430). Other GPUs will most likely lead to different speed-ups, but they should be in the same ballpark range:
"""
#!pip install --upgrade --quiet transformers accelerate datasets
"""In addition, we will make use of [Flash Attention 2](), as it saves
a lot of memory and speeds up large matmul operations.
"""
#!pip install --quiet flash-attn --no-build-isolation
"""To begin with, let's load the dataset that we will use for benchmarking. We'll load a small dataset consisting of 73 samples from the [LibriSpeech ASR](https://huggingface.co/datasets/librispeech_asr) validation-clean dataset. This amounts to ~9MB of data, so it's very lightweight and quick to download on device:"""
from datasets import load_dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
"""We start by benchmarking [Whisper large-v2](https://huggingface.co/openai/whisper-large-v2) to get our baseline number. We'll load the model in `float16` precision and make sure that loading time takes as little time as possible by passing `low_cpu_mem_usage=True`. In addition, we want to make sure that the model is loaded in [`safetensors`](https://github.com/huggingface/safetensors) format by passing `use_safetensors=True`:"""
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
device = "cuda:0"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-base"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=False
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
"""Great! For the benchmark, we will only measure the generation time (encoder + decoder), so let's write a short helper function that measures this step:"""
import time
def generate_with_time(model, inputs):
start_time = time.time()
outputs = model.generate(**inputs)
generation_time = time.time() - start_time
return outputs, generation_time
"""This function will return both the decoded tokens as well as the time
it took to run the model.
We now iterate over the audio samples and sum up the generation time.
"""
from tqdm import tqdm
all_time = 0
for sample in tqdm(dataset):
audio = sample["audio"]
inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt")
inputs = inputs.to(device=device, dtype=torch.float16)
output, gen_time = generate_with_time(model, inputs)
all_time += gen_time
print(processor.batch_decode(output, skip_special_tokens=True))
print(all_time)
"""Alright! In total it took roughly 63 seconds to transcribe 73 audio samples.
Next, let's see how much time it takes with [Distil-Whisper](https://huggingface.co/distil-whisper/distil-large-v2):
"""
model_id = "distil-whisper/distil-medium.en"
distil_model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=False
)
distil_model = distil_model.to(device)
"""We run the same benchmarking loop:"""
all_time = 0
for sample in tqdm(dataset):
audio = sample["audio"]
inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt")
inputs = inputs.to(device=device, dtype=torch.float16)
output, gen_time = generate_with_time(distil_model, inputs)
all_time += gen_time
print(processor.batch_decode(output, skip_special_tokens=True))
print(all_time)
"""Only 10 seconds - that amounts to a 6x speed-up!
## Memory
In addition to being significantly faster, Distil-Whisper also has fewer parameters. Let's have a look at how many fewer exactly.
"""
distil_model.num_parameters() / model.num_parameters() * 100
"""Distil-Whisper is 49% of the size of Whisper. Note that this ratio is much lower if we would just compare the size of the decoder:"""
distil_model.model.decoder.num_parameters() / model.model.decoder.num_parameters() * 100
"""As expected the decoder is much smaller. One might have guessed that it should even be less, around 2/32 (or 6%), but we can't forget that the decoder has a very large word embedding that requires a lot of parameters.
## Next steps
Hopefully this notebook shed some light on the motivation behind Distil-Whisper! For now, we've measured Distil-Whisper mainly on GPU, but are now actively looking into collaborating to release code how to effectively accelerate Distil-Whisper on CPU as well. Updates will be posted on the Distil-Whisper [repository](https://github.com/huggingface/distil-whisper).
Another key application of Distil-Whisper is *speculative decoding*. In speculative decoding, we can use Distil-Whisper as an *assitant model* to Whisper-large-v2 to reach a speed-up of 2x without **any** loss in performance. More on that in a follow-up notebook that will come out soon!
"""
Yes indeed it's the tokenizer that's the issue - the checkpoint openai/whisper-base
uses a different tokeniser to distil-whisper/distil-medium.en
. You need to load the tokenizer for distil-whisper/distil-medium.en
to decode the generated ids from the distilled model. See the diff below, the line in green is the additional line you need to add:
# -*- coding: utf-8 -*-
"""Distil_Whisper_Benchmark.ipynb
## Benchmarking
Great, now that we've understood why Distil-Whisper should be faster in theory, let's see if it holds true in practice.
To begin with, we install `transformers`, `accelerate`, and `datasets`.
In this notebook, we use a A100 GPU that is available through a Colab pro subscription, as this is the device we used for benchmarking in the [Distil-Whisper paper](https://huggingface.co/papers/2311.00430). Other GPUs will most likely lead to different speed-ups, but they should be in the same ballpark range:
"""
#!pip install --upgrade --quiet transformers accelerate datasets
"""In addition, we will make use of [Flash Attention 2](), as it saves
a lot of memory and speeds up large matmul operations.
"""
#!pip install --quiet flash-attn --no-build-isolation
"""To begin with, let's load the dataset that we will use for benchmarking. We'll load a small dataset consisting of 73 samples from the [LibriSpeech ASR](https://huggingface.co/datasets/librispeech_asr) validation-clean dataset. This amounts to ~9MB of data, so it's very lightweight and quick to download on device:"""
from datasets import load_dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
"""We start by benchmarking [Whisper large-v2](https://huggingface.co/openai/whisper-large-v2) to get our baseline number. We'll load the model in `float16` precision and make sure that loading time takes as little time as possible by passing `low_cpu_mem_usage=True`. In addition, we want to make sure that the model is loaded in [`safetensors`](https://github.com/huggingface/safetensors) format by passing `use_safetensors=True`:"""
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
device = "cuda:0"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-base"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=False
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
"""Great! For the benchmark, we will only measure the generation time (encoder + decoder), so let's write a short helper function that measures this step:"""
import time
def generate_with_time(model, inputs):
start_time = time.time()
outputs = model.generate(**inputs)
generation_time = time.time() - start_time
return outputs, generation_time
"""This function will return both the decoded tokens as well as the time
it took to run the model.
We now iterate over the audio samples and sum up the generation time.
"""
from tqdm import tqdm
all_time = 0
for sample in tqdm(dataset):
audio = sample["audio"]
inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt")
inputs = inputs.to(device=device, dtype=torch.float16)
output, gen_time = generate_with_time(model, inputs)
all_time += gen_time
print(processor.batch_decode(output, skip_special_tokens=True))
print(all_time)
"""Alright! In total it took roughly 63 seconds to transcribe 73 audio samples.
Next, let's see how much time it takes with [Distil-Whisper](https://huggingface.co/distil-whisper/distil-large-v2):
"""
model_id = "distil-whisper/distil-medium.en"
distil_model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=False
)
distil_model = distil_model.to(device)
+ processor = AutoProcessor.from_pretrained(model_id)
"""We run the same benchmarking loop:"""
all_time = 0
for sample in tqdm(dataset):
audio = sample["audio"]
inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt")
inputs = inputs.to(device=device, dtype=torch.float16)
output, gen_time = generate_with_time(distil_model, inputs)
all_time += gen_time
print(processor.batch_decode(output, skip_special_tokens=True))
print(all_time)
"""Only 10 seconds - that amounts to a 6x speed-up!
## Memory
In addition to being significantly faster, Distil-Whisper also has fewer parameters. Let's have a look at how many fewer exactly.
"""
distil_model.num_parameters() / model.num_parameters() * 100
"""Distil-Whisper is 49% of the size of Whisper. Note that this ratio is much lower if we would just compare the size of the decoder:"""
distil_model.model.decoder.num_parameters() / model.model.decoder.num_parameters() * 100
"""As expected the decoder is much smaller. One might have guessed that it should even be less, around 2/32 (or 6%), but we can't forget that the decoder has a very large word embedding that requires a lot of parameters.
## Next steps
Hopefully this notebook shed some light on the motivation behind Distil-Whisper! For now, we've measured Distil-Whisper mainly on GPU, but are now actively looking into collaborating to release code how to effectively accelerate Distil-Whisper on CPU as well. Updates will be posted on the Distil-Whisper [repository](https://github.com/huggingface/distil-whisper).
Another key application of Distil-Whisper is *speculative decoding*. In speculative decoding, we can use Distil-Whisper as an *assitant model* to Whisper-large-v2 to reach a speed-up of 2x without **any** loss in performance. More on that in a follow-up notebook that will come out soon!
"""
Yeah! It worked like a charm, thank you! PD: Maybe should be a good idea to modify the colab or put a warning on the markdown cells
Excellent - glad to hear that @juansebashr! Closing as complete - feel free to open a new issue if you encounter any other problems!
Hi! I was running the Colab code into a Jetson Xavier platform with CUDA 10.8 and a custom compiled torch 1.8, we can't update Jetpack/CUDA right now due to other limitations but I managed to run the model on GPU, but I got this translations
etc...
Anyone has some idea on what it's happening? And where should I start looking to fixing it?
Thank you soo much :D