[whisper] transcription is different from hf & openai

jsoto-gladia commented 1 month ago

System Info

transformers version: 4.40.2
Platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35
Python version: 3.10.14
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: 0.24.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

silence-middle.wav.zip the following wav file produces

with hf whisper repo (tiny model): Split infinity, and a time when less is more. Where too much is never
with openai whisper repo (tiny model) Split infinity, and a time when less is more.

Expected behavior

I would expect the result to be Split infinity, and a time when less is more. Where too much is never

jsoto-gladia commented 1 month ago

you are missing the re encoding mechanism happening when eos is reached within a 30s segment

amyeroberts commented 1 month ago

cc @ylacombe @sanchit-gandhi

ylacombe commented 2 weeks ago

Hey @jsoto-gladia, many thanks for opening this issue!

This looks like an interesting finding, do you think you could provide code snippets (both in transformers and in the whisper repo) to allow us to reproduce it ?

huggingface / transformers