ASR pipeline long-form audio processing requires `return_timestaps=True`

as-suvorov commented 1 week ago

System Info

transformers version: 4.45.2
Platform: Linux-5.17.15-051715-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.24.3
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cpu (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:

Who can help?

@gante @Rocketknight1

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

reproducer.py

from transformers import pipeline
import datasets
import typing

def get_sample_from_dataset():
    ds = datasets.load_dataset(
        "distil-whisper/meanwhile",
        split="test",
        streaming=True,
        trust_remote_code=True,
    )

    ds = typing.cast(datasets.IterableDataset, ds)
    ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16000))
    ds = ds.take(1)

    return [x["audio"] for x in ds]

sample = get_sample_from_dataset()

whisper = pipeline("automatic-speech-recognition", "openai/whisper-tiny")

transcription = whisper(sample)

print(transcription)

Steps to reproduce:

pip install datasets transformers=4.44.2
python reproducer.py Actual behavior - pipeline completes successfully
pip install transformers=4.45.0

python reproducer.py Actual behavior - pipeline completes fails with error:

ValueError: You have passed more than 3000 mel input features (> 30 seconds) which automatically enables long-form generation which requires the model to predict timestamp tokens. Please either pass `return_timestamps=True` or make sure to pass no more than 3000 mel input features.

Expected behavior

There is a change in asr pipeline behavior between transformers versions 4.44.2 and 4.45.0. Exact PR: Pipeline: no side-effects on model.config and model.generation_config.

Transformers version 4.44.2 long-form processing doesn't require return_timestamps=True, completes successfully. Version 4.45.0 requires return_timestamps=True, fails otherwise.

Is it intended change in behavior?

Rocketknight1 commented 1 week ago

cc @gante I expect that was probably a regression - I have capacity to take this one, but if you think you can fix it quickly, feel free to grab it!

gante commented 1 week ago

@Rocketknight1 please go ahead with the fix 🙏

as-suvorov commented 1 week ago

@Rocketknight1 , @gante thank you for analysis!

Rocketknight1 commented 1 week ago

On investigation, this isn't really a bug. The whisper models set return_timestamps=False in their generation_config.json, which means it will be used as the default value. The only change is that this value is now being correctly loaded by the pipelines.

@gante I think the new behaviour is probably more correct, and users should just set return_timestamps=True to override the JSON config when they need to?

huggingface / transformers