AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction

Hubert-Bonisseur commented 3 months ago

System Info

At present, the AutomaticSpeechRecognition pipeline offers the capability to predict timestamps either at the word level through cross attention or by utilizing timestamps tokens predicted by Whisper. The concern arises when opting for word-level prediction, as it activates timestamp prediction, which cannot be disabled as far as I know. This setup may inadvertently reduce timestamp accuracy or cause the word timestamps prediction to fail entirely, particularly for models fine-tuned without timestamp prediction.

Other frameworks can predict word timestamps appropriately with some arguments, for instance openai's whisper framework:

import whisper
whisper_model = whisper.load_model("large-v3")
result = whisper_model.transcribe("long_audio.wav",  without_timestamps=True, word_timestamps=True)

Who can help?

@sanchit-gandhi

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Take any model fientuned without timestamps, and get bogus word timestamps because attention is all over the place

Expected behavior

It should be possible to disable timestamps generation when choosing word timestamps

amyeroberts commented 3 months ago

cc @ylacombe too

Hubert-Bonisseur commented 3 months ago

I investigated a bit and got a more precise idea of the issue. There are actually two bugs with the pipeline, I think:

num_frames is not passed to the generate method, which makes the timestamps wrong:

With the pipeline:

from datasets import load_dataset, Audio
from transformers import AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration, GenerationConfig
from transformers import  AutoTokenizer, AutoFeatureExtractor

model_path = "BrunoHays/whisper-large-v3-french-illuin"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = WhisperForConditionalGeneration.from_pretrained(model_path)
processor = AutoFeatureExtractor.from_pretrained(model_path)

ds = load_dataset("mozilla-foundation/common_voice_13_0", "fr", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
item = next(iter(ds["test"]))["audio"]
audio = item["array"]
sr = item["sampling_rate"]

pipe = AutomaticSpeechRecognitionPipeline(model=model, feature_extractor=processor, tokenizer=tokenizer)
transcript = pipe(audio, return_timestamps="word", generate_kwargs={
})
print(transcript)

All the word timestamps are set to 29.98s

Without using the pipeline:

features = processor(audio, return_tensors="pt",
                     truncation=False, sampling_rate=sr,
                     return_attention_mask=True)
generated = model.generate(features.input_features,
                           return_timestamps="word",
                           task="transcribe",
                           language="fr",
                           return_token_timestamps=True,
                           num_frames=int(len(audio) / processor.hop_length), # <-- doesn't work without this
                           is_multilingual=True)
print(generated["token_timestamps"])

The word timestamps are now appropriate.

Long form generation always enables timestamps, I think this is linked to this PR from @patrickvonplaten. I added print(decoder_input_ids)to the forward method of Whisper to check the input tokens of the decoder and these are the first tokens fed to the forward method: [50258, 50265, 50360, 50365] --> Notice we don't have the no_timestamps token and we generated the timestamp 0.0

Code to reproduce:

from datasets import load_dataset, Audio
from transformers import AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration, GenerationConfig
from transformers import  AutoTokenizer, AutoFeatureExtractor

model_path = "BrunoHays/whisper-large-v3-french-illuin"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = WhisperForConditionalGeneration.from_pretrained(model_path)
processor = AutoFeatureExtractor.from_pretrained(model_path)
ds = load_dataset("BrunoHays/multilingual-TEDX-fr", "max", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
item = next(iter(ds["test"]))["audio"]
audio = item["array"]
sr = item["sampling_rate"]

pipe = AutomaticSpeechRecognitionPipeline(model=model, feature_extractor=processor, tokenizer=tokenizer)
transcript = pipe(audio, return_timestamps=False) # No timestamps  
print(transcript)

I think the audio chunking method for long form should be used when timestamps are deactivated

amyeroberts commented 2 months ago

Gentle ping @ylacombe

kamilakesbi commented 2 months ago

Hi @Hubert-Bonisseur,

Thanks for sharing this issue!

Remark 1 was solved in PR #30325.
Regarding Remark 2: Long-form generation indeed requires timestamps to chunk the audios so this is an expected behavior I think. Using long-form generation significantly improves the model's performance in comparison to chunked transcription as indicated in this PR, so for now I don't think we should use audio chunking for long audios when timestamps are deactivated. WDYT @sanchit-gandhi ?

sanchit-gandhi commented 1 month ago

Agreed @Hubert-Bonisseur and @kamilakesbi! In the case you don't want to train another model, the only option for long-form transcription is doing chunked inference:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=25,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

If you're happy training another model, but don't have timestamps in your data, you can try training with LoRA. Using LoRAs reduces the amount of catastrophic forgetting, so even though we don't have timestamps in our fine-tuning data, the model remembers how to make timestamp'd predictions. You can see a guide on LoRA fine-tuning using the PEFT library here. Note that you want to run inference in half/full precision (not 8-bit), as outlined here.

Hubert-Bonisseur commented 1 month ago

I missed the notifications, sorry Thanks for your answers !

If I understand correctly, @sanchit-gandhi , chunked transcription is activated once we pass the chunk_length_s argument ? And if it is left to default, we use the fancy long-form algorithm ? Makes sense !

kamilakesbi commented 2 weeks ago

Hi @Hubert-Bonisseur,

You can indeed use chunked transcription by passing the chunk_length_s argument to the pipeline as shown in @sanchit-gandhi's script.

The default behavior is to use the long-form algorithm as it is more efficient :)

Hope it will help you!

huggingface / transformers