huggingface / transformers

đŸ¤— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.57k stars 27.14k forks source link

Whisper pipeline max_new_tokens generation parameter question #34490

Open as-suvorov opened 4 weeks ago

as-suvorov commented 4 weeks ago

System Info

Who can help?

@Rocketknight1 @yla

Information

Tasks

Reproduction

reproducer.py

from transformers import pipeline
import datasets
import typing

def get_sample_from_dataset():
    ds = datasets.load_dataset(
        "distil-whisper/meanwhile",
        split="test",
        streaming=True,
        trust_remote_code=True,
    )

    ds = typing.cast(datasets.IterableDataset, ds)
    ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16000))
    ds = ds.take(1)

    return next(iter(ds))["audio"]

sample = get_sample_from_dataset()

whisper = pipeline("automatic-speech-recognition", "openai/whisper-tiny")

transcription = whisper(
    sample.copy(),
    return_timestamps=True,
)

print(transcription['text'])

#  Folks, if you watch the show, you know, I spent a lot of time right over there. Patiently and astutely scrutinizing the boxwood and mahogany chest set of the days, big stories, developing the central headline pawns, definitely maneuvering an OSO topical night to F6, faming of classic Sicilian, named or variation on the news, all the while seeing eight moves deep and patiently marshalling the latest press releases into a Fisher shows in lip nitsky attack that culminates in the The elegant lethal slow played all-pass on checkmate that is my nightly monologue, but sometimes sometimes folks I sometimes I start a little wake upside down in the monkey bars of a condemned playground on a super fun site. Get all hepped up on goofballs, rummage that were discarded tag bag of defective toys. Yank out a fistball of disembodied doll limbs, toss them on a stained kid's place, mad from a defunct denies, set up a table inside a rusty cargo container down by the warf, and challenged toothless drifters to the godless, bug house blitz of tournament that is my segment. Me and Wild.

transcription = whisper(
    sample.copy(),
    max_new_tokens=10,
    return_timestamps=True,
)

print(transcription['text'])

#  Folks, if you watch the show, you that is my nightly monologue, but Let's have tournament that is my segment.

# Outputs from both runs
#  [1st chunk                           ][2dn chunk                           ][3rd chunk                                 ]
#  Folks, if you watch the show, you ... that is my nightly monologue, but ... tournament that is my segment. Me and Wild.
#  Folks, if you watch the show, you that is my nightly monologue, but Let's have tournament that is my segment.

Steps to reproduce:

  1. pip install datasets transformers=4.44.2
  2. python reproducer.py

Expected behavior

There is a question about max_new_tokens for long-form audio sequential processing. It seems max_new_tokens currently applied for each processed chunk and not for the whole output sequence. From the reproducer it looks like pipeline captures first 10 tokens from the each chunk and then concatenates into output. Is it expected behavior?

Rocketknight1 commented 4 weeks ago

I think this is expected behaviour! cc @ylacombe if you think it isn't and I can make a patch