Using distributed or parallel set-up in script?: no
Who can help?
@Rocketknight1 @yla
Information
[ ] The official example scripts
[X] My own modified scripts
Tasks
[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction
reproducer.py
from transformers import pipeline
import datasets
import typing
def get_sample_from_dataset():
ds = datasets.load_dataset(
"distil-whisper/meanwhile",
split="test",
streaming=True,
trust_remote_code=True,
)
ds = typing.cast(datasets.IterableDataset, ds)
ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16000))
ds = ds.take(1)
return next(iter(ds))["audio"]
sample = get_sample_from_dataset()
whisper = pipeline("automatic-speech-recognition", "openai/whisper-tiny")
transcription = whisper(
sample.copy(),
return_timestamps=True,
)
print(transcription['text'])
# Folks, if you watch the show, you know, I spent a lot of time right over there. Patiently and astutely scrutinizing the boxwood and mahogany chest set of the days, big stories, developing the central headline pawns, definitely maneuvering an OSO topical night to F6, faming of classic Sicilian, named or variation on the news, all the while seeing eight moves deep and patiently marshalling the latest press releases into a Fisher shows in lip nitsky attack that culminates in the The elegant lethal slow played all-pass on checkmate that is my nightly monologue, but sometimes sometimes folks I sometimes I start a little wake upside down in the monkey bars of a condemned playground on a super fun site. Get all hepped up on goofballs, rummage that were discarded tag bag of defective toys. Yank out a fistball of disembodied doll limbs, toss them on a stained kid's place, mad from a defunct denies, set up a table inside a rusty cargo container down by the warf, and challenged toothless drifters to the godless, bug house blitz of tournament that is my segment. Me and Wild.
transcription = whisper(
sample.copy(),
max_new_tokens=10,
return_timestamps=True,
)
print(transcription['text'])
# Folks, if you watch the show, you that is my nightly monologue, but Let's have tournament that is my segment.
# Outputs from both runs
# [1st chunk ][2dn chunk ][3rd chunk ]
# Folks, if you watch the show, you ... that is my nightly monologue, but ... tournament that is my segment. Me and Wild.
# Folks, if you watch the show, you that is my nightly monologue, but Let's have tournament that is my segment.
Steps to reproduce:
pip install datasets transformers=4.44.2
python reproducer.py
Expected behavior
There is a question about max_new_tokens for long-form audio sequential processing. It seems max_new_tokens currently applied for each processed chunk and not for the whole output sequence. From the reproducer it looks like pipeline captures first 10 tokens from the each chunk and then concatenates into output.
Is it expected behavior?
System Info
transformers
version: 4.45.2Who can help?
@Rocketknight1 @yla
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
reproducer.py
Steps to reproduce:
pip install datasets transformers=4.44.2
python reproducer.py
Expected behavior
There is a question about
max_new_tokens
for long-form audio sequential processing. It seemsmax_new_tokens
currently applied for each processed chunk and not for the whole output sequence. From the reproducer it looks like pipeline captures first 10 tokens from the each chunk and then concatenates into output. Is it expected behavior?