huggingface / transformers

šŸ¤— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.01k stars 26.79k forks source link

ASR pipeline long-form audio processing requires `return_timestaps=True` #34192

Open as-suvorov opened 1 week ago

as-suvorov commented 1 week ago

System Info

Who can help?

@gante @Rocketknight1

Information

Tasks

Reproduction

reproducer.py

from transformers import pipeline
import datasets
import typing

def get_sample_from_dataset():
    ds = datasets.load_dataset(
        "distil-whisper/meanwhile",
        split="test",
        streaming=True,
        trust_remote_code=True,
    )

    ds = typing.cast(datasets.IterableDataset, ds)
    ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16000))
    ds = ds.take(1)

    return [x["audio"] for x in ds]

sample = get_sample_from_dataset()

whisper = pipeline("automatic-speech-recognition", "openai/whisper-tiny")

transcription = whisper(sample)

print(transcription)

Steps to reproduce:

  1. pip install datasets transformers=4.44.2
  2. python reproducer.py Actual behavior - pipeline completes successfully
  3. pip install transformers=4.45.0
  4. python reproducer.py Actual behavior - pipeline completes fails with error:
    ValueError: You have passed more than 3000 mel input features (> 30 seconds) which automatically enables long-form generation which requires the model to predict timestamp tokens. Please either pass `return_timestamps=True` or make sure to pass no more than 3000 mel input features.

Expected behavior

There is a change in asr pipeline behavior between transformers versions 4.44.2 and 4.45.0. Exact PR: Pipeline: no side-effects on model.config and model.generation_config.

Transformers version 4.44.2 long-form processing doesn't require return_timestamps=True, completes successfully. Version 4.45.0 requires return_timestamps=True, fails otherwise.

Is it intended change in behavior?

Rocketknight1 commented 1 week ago

cc @gante I expect that was probably a regression - I have capacity to take this one, but if you think you can fix it quickly, feel free to grab it!

gante commented 1 week ago

@Rocketknight1 please go ahead with the fix šŸ™

as-suvorov commented 1 week ago

@Rocketknight1 , @gante thank you for analysis!

Rocketknight1 commented 1 week ago

On investigation, this isn't really a bug. The whisper models set return_timestamps=False in their generation_config.json, which means it will be used as the default value. The only change is that this value is now being correctly loaded by the pipelines.

@gante I think the new behaviour is probably more correct, and users should just set return_timestamps=True to override the JSON config when they need to?