[Whisper] Attention mask not detected in `Whisper.generate()`

benniekiss commented 1 month ago

System Info

transformers version: 4.44.0.dev0
Platform: Linux-6.9.7-201.fsync.fc40.x86_64-x86_64-with-glibc2.35
Python version: 3.12.4
Huggingface_hub version: 0.24.1
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA GeForce RTX 3080

Who can help?

@sanchit-gandhi

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset, Audio

processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
model.cuda()
# load audios > 30 seconds
ds = load_dataset("distil-whisper/meanwhile", "default")["test"]
# resample to 16kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
# take first 8 audios and retrieve array
audio = ds[:8]["audio"]
audio = [x["array"] for x in audio]

# make sure to NOT truncate the input audio, to return the `attention_mask` and to pad to the longest audio
inputs = processor(audio, return_tensors="pt", truncation=False, padding="longest", return_attention_mask=True, sampling_rate=16_000)
inputs = inputs.to("cuda", torch.float32)

# transcribe audio to ids
generated_ids = model.generate(
    **inputs, 
)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
transcription[0]

Expected behavior

When an attention_mask is passed to generate(), the following warning pops up indicating that an attention_mask was not set:

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

I think this is because attention_mask is not actually passed down to generate_with_fallback, so it doesn't get passed to the underlying super().generate() call

ArthurZucker commented 2 weeks ago

Would you like to open a PR for a fix? cc @ylacombe as well!

benniekiss commented 2 weeks ago

I would be happy to!

huggingface / transformers

[Whisper] Attention mask not detected in `Whisper.generate()` #32228