huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.54k stars 280 forks source link

initial_prompt support? #20

Open silvacarl2 opened 11 months ago

silvacarl2 commented 11 months ago

Does it have initial_prompt support?

we use this a lot.

silvacarl2 commented 11 months ago

like this: https://github.com/openai/whisper/discussions/963

$ whisper --help optional arguments: --initial_prompt INITIAL_PROMPT optional text to provide as a prompt for the first window. (default: None)

$ whisper-ctranslate2 --help optional arguments: --initial_prompt INITIAL_PROMPT optional text to provide as a prompt for the first window. (default: None)

sanchit-gandhi commented 10 months ago

Yes, currently for batch size 1:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
input_speech = dataset[3]["audio"]["array"]

processor = WhisperProcessor.from_pretrained("distil-whisper/distil-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("distil-whisper/distil-large-v2")
input_features = processor(input_speech, return_tensors="pt").input_features

# --- Without prompt ---
output_without_prompt = model.generate(input_features)
print(processor.decode(output_without_prompt[0]))
# <|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of Rocky Ithaca.<|endoftext|>

# --- With prompt ---
# Let's change the spelling of "Leighton" -> "Layton" by passing it as a prompt
prompt_ids = processor.get_prompt_ids("Layton")
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0]))
# <|startofprev|> Layton<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover in it but little of Rocky Ithaca.<|endoftext|>

I'll generalise this for batch size N upstream in Transformers!

silvacarl2 commented 10 months ago

THIS IS AWESOME!!!!!!!!!!!!!!!!!!! YOU ROCK!!!!!!!!!!!!!!!!!!!!

SO PERFECT!!!!!!!!!!!!!!!!!!!

sanchit-gandhi commented 10 months ago

Sorry about the delay 😅 Hoping we have this fixed for bs=N very shortly!

silvacarl2 commented 10 months ago

take your time, this is so cool we will start testing with it now.

silvacarl2 commented 10 months ago

Question: for larger audio files, do we need to split it up into 30 second chunks?