Whisper language is not returned when using word timestamps

robinderat commented 6 months ago

System Info

transformers version: 4.38.2
Platform: macOS-13.6.1-x86_64-i386-64bit
Python version: 3.10.11
Huggingface_hub version: 0.21.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.2.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@sanchit-gandhi @ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I am using the automatic speech recognition pipeline with whisper large v3. I want to get both the language as detected by whisper and the timestamps for the individual words. Both behaviors are supported individually( return_language=True, return_timestamps="word"), however when combined, the language is no longer returned.

This is the code i am using:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, use_safetensors=True, torch_dtype=torch.float16, low_cpu_mem_usage=True,
)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch.float16,
    return_language=True
)

result = pipe(audio_filepath, return_timestamps="word")
print(result)

I have done some digging in the source code and I believe I have found the problem in tokenization_whisper.py

Here, when using word timestamps, the existing chunks containing the language are ignored and only the words in the chunk are returned .

if return_timestamps or return_language:
        for chunk in chunks:
            if not return_timestamps:
                chunk.pop("timestamp")
            else:
                chunk["timestamp"] = tuple(chunk["timestamp"])
            if not return_language:
                chunk.pop("language")

        if return_timestamps == "word":
            new_chunks = []
            for chunk in chunks:
                new_chunks.extend(chunk["words"])
            optional = {"chunks": new_chunks}
        else:
            optional = {"chunks": chunks}
    else:
        optional = {}
    return full_text, optional

Expected behavior

I expect to get both the language and the word timestamps when using return_language=True and return_timestamps="word"

Considering the source code mentioned above, I see 2 possible ways of achieving this.

1) Add an extra list to the chunks when using return_timestamps='word'

This would result in the format

{
  "text":"FULL TEXT", 
  "chunks": [
    "language": "LANGUAGE", 
    "timestamp": TIMESTAMP, 
    "text": "CHUNK_TEXT", 
    "words": [
      "text": "WORD", 
      "timestamp": WORD_TIMESTAMP
    ]
  ]
}

2) Add the language information to the words

This would maintain the current structure and simply add the language of the chunk to its words, giving the following format

{
  "text":"FULL TEXT", 
  "chunks": [
    "language": "LANGUAGE", 
    "timestamp": WORD_TIMESTAMP, 
    "text": "WORD"
  ]
}

I believe solution 1 could be achieved simply by removing the if return_timestamps == "word": statement completely. I believe solution 2 could be achieved by replacing the new_chunks.extend(chunk["words"]) with


for word in chunk["words"]:
    word["language"] = chunk["language"]
    new_chunks.append(word)

ArthurZucker commented 6 months ago

That is expected IMO: how do you choose which language to return? Do you return a language for each chunk? For each text between timestamps? Which language do you return the one that is the most used? I think these where considerations. Also getting the language first should be pretty easy.

cc @ylacombe as well

robinderat commented 5 months ago

I have implemented the 2nd approach I outlined in a fork of the project (which is not ideal) and it's working just fine. In my opinion, if you can predict the language for a chunk, there is no harm is saying all words in that chunk belong to that language.

If there is an easy way to get the language first that would be fine too, but I don't see how besides running the model twice, which is slow and wasteful

kamilakesbi commented 4 months ago

Hi @robinderat,

Thank you so much for this question!

This is indeed a bug, and fixing it would be a very nice contribution. The second approach you suggest looks good as it would keep the structure of the output and just add the missing attributes.

Would you like to open a PR for this fix, given that you have implemented it in a forked project?

This would be a really valuable fix for the Whisper community :)

huggingface / transformers