Whisper pipeline max_new_tokens generation parameter question

System Info

transformers version: 4.45.2
Platform: Linux-5.17.15-051715-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.26.1
Safetensors version: 0.4.5
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.5.0+cpu (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no

Who can help?

@Rocketknight1 @yla

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

reproducer.py

from transformers import pipeline
import datasets
import typing

def get_sample_from_dataset():
    ds = datasets.load_dataset(
        "distil-whisper/meanwhile",
        split="test",
        streaming=True,
        trust_remote_code=True,
    )

    ds = typing.cast(datasets.IterableDataset, ds)
    ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16000))
    ds = ds.take(1)

    return next(iter(ds))["audio"]

sample = get_sample_from_dataset()

whisper = pipeline("automatic-speech-recognition", "openai/whisper-tiny")

transcription = whisper(
    sample.copy(),
    return_timestamps=True,
)

print(transcription['text'])

#  Folks, if you watch the show, you know, I spent a lot of time right over there. Patiently and astutely scrutinizing the boxwood and mahogany chest set of the days, big stories, developing the central headline pawns, definitely maneuvering an OSO topical night to F6, faming of classic Sicilian, named or variation on the news, all the while seeing eight moves deep and patiently marshalling the latest press releases into a Fisher shows in lip nitsky attack that culminates in the The elegant lethal slow played all-pass on checkmate that is my nightly monologue, but sometimes sometimes folks I sometimes I start a little wake upside down in the monkey bars of a condemned playground on a super fun site. Get all hepped up on goofballs, rummage that were discarded tag bag of defective toys. Yank out a fistball of disembodied doll limbs, toss them on a stained kid's place, mad from a defunct denies, set up a table inside a rusty cargo container down by the warf, and challenged toothless drifters to the godless, bug house blitz of tournament that is my segment. Me and Wild.

transcription = whisper(
    sample.copy(),
    max_new_tokens=10,
    return_timestamps=True,
)

print(transcription['text'])

#  Folks, if you watch the show, you that is my nightly monologue, but Let's have tournament that is my segment.

# Outputs from both runs
#  [1st chunk                           ][2dn chunk                           ][3rd chunk                                 ]
#  Folks, if you watch the show, you ... that is my nightly monologue, but ... tournament that is my segment. Me and Wild.
#  Folks, if you watch the show, you that is my nightly monologue, but Let's have tournament that is my segment.

Steps to reproduce:

pip install datasets transformers=4.44.2
python reproducer.py

Expected behavior

There is a question about max_new_tokens for long-form audio sequential processing. It seems max_new_tokens currently applied for each processed chunk and not for the whole output sequence. From the reproducer it looks like pipeline captures first 10 tokens from the each chunk and then concatenates into output. Is it expected behavior?

huggingface / transformers

Whisper pipeline max_new_tokens generation parameter question #34490