Closed kmn1024 closed 6 months ago
Could this be because:
Hey @kmn1024! Thanks for opening this super interesting issue.
distil-medium.en
uses 24 encoder layers, as opposed to just 12 in distil-small.en
, so I'd expect the memory overhead to come on the encoder side.We decided not to release VRAM (memory) numbers in our benchmarks, since they're very dependent on hardware, CUDA version and PyTorch version. But we record some of these numbers ourselves. In my provisional benchmark, averaging over 100 samples of the LibriSpeech dataset, I got:
distil-small.en
: 1.95GBdistil-medium.en
: 2.79GB
=> so quite convincingly, distil-small.en
was lower memory than distil-medium.en
One reason for higher memory could be more decoding steps in distil-small.en
vs distil-medium.en
, possibly because of a hallucination? This would increase the memory of the k/v cache. It could be a good idea to average over a number of different samples, e.g. as per:
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from transformers.models.whisper.english_normalizer import EnglishTextNormalizer
from datasets import load_dataset
from evaluate import load
import torch
from tqdm import tqdm
# define our torch configuration
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-small.en"
# load the model + processor
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, use_safetensors=True, low_cpu_mem_usage=True)
model = model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
# load the dataset with streaming mode
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# define the evaluation metric
wer_metric = load("wer")
normalizer = EnglishTextNormalizer(processor.tokenizer.english_spelling_normalizer)
def inference(batch):
# 1. Pre-process the audio data to log-mel spectrogram inputs
audio = [sample["array"] for sample in batch["audio"]]
input_features = processor(audio, sampling_rate=batch["audio"][0]["sampling_rate"], return_tensors="pt").input_features
input_features = input_features.to(device, dtype=torch_dtype)
# 2. Auto-regressively generate the predicted token ids
pred_ids = model.generate(input_features, max_new_tokens=128)
# 3. Decode the token ids to the final transcription
batch["transcription"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
batch["reference"] = batch["text"]
return batch
dataset = dataset.map(function=inference, batched=True, batch_size=16)
all_transcriptions = []
all_references = []
# iterate over the dataset and run inference
for i, result in tqdm(enumerate(dataset), desc="Evaluating..."):
all_transcriptions.append(result["transcription"])
all_references.append(result["reference"])
# normalize predictions and references
all_transcriptions = [normalizer(transcription) for transcription in all_transcriptions]
all_references = [normalizer(reference) for reference in all_references]
# compute the WER metric
wer = 100 * wer_metric.compute(predictions=all_transcriptions, references=all_references)
print(wer)
Hi @kmn1024 đź‘‹ I did the conversions to ONNX, so I might have an explanation for this. I believe this is due to the additional outputs nodes, corresponding to the computed attentions. The reason I exported with these outputs is so that users can generate word-level timestamps with these models (and this might not be the case for the previous medium models).
If this is something you will not need, you can do the conversions yourself with Optimum:
optimum-cli export onnx -m distil-whisper/distil-small.en output
Thanks all! I can confirm that converting and quantizing from scratch works. The numbers are now: | Model | RTF | Mem Used |
---|---|---|---|
distil-medium-en | 0.5082902994641076 | 1878.0 MiB | |
distil-small-en | 0.3782055584150302 | 912.6875 MiB |
P.S. The Optimum quantization command doesn't work out of the box; had to skip conv nodes as suggested in https://github.com/microsoft/onnxruntime/issues/15888.
Thanks for the great explanation @xenova!
Setup
CUDA 12.2 GTX 1080 Copied all ONNX quantized models and required config jsons to their required location.
Code
Results