How to set language in Whisper pipeline for audio transcription?

melihogutcen commented 1 year ago

Problem

Hello,

I followed this notebook for Whisper pipelines. https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor?usp=sharing#scrollTo=Ca4YYdtATxzo

Here, I want to use speech transcription with openai/whisper-large-v2 model using the pipeline. By using WhisperProcessor, we can set the language, but this has a disadvantage for longer audio files than 30 seconds. I used the below code and I can set the language here.

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2").to(device)
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
inputs = processor.feature_extractor(speech_data, return_tensors="pt", sampling_rate=16_000).input_features.to(device)
generate_ids = model.generate(inputs, max_length=480_000, language="<|tr|>", task="transcribe", return_timestamps=True)
results = processor.tokenizer.decode(generate_ids[0], decode_with_timestamps=True, output_offsets=True)

Long audio files can be processed in the pipeline by setting chunk_length as below. But in the pipeline, I couldn't set the language. Therefore, I have gotten English results in my Turkish speech data.

from transformers import pipeline
MODEL_NAME = "openai/whisper-large-v2"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    device='cpu')

pipe(speech_file, return_timestamps=True, chunk_length_s=30, stride_length_s=[6,0], batch_size=32)

Is there a way to set the language?

System Info

docker image:

pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime

Transformers Version: transformers==v4.27dev

Who can help?

@sanchit-gandhi @Narsil

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import pipeline
MODEL_NAME = "openai/whisper-large-v2"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    device='cpu')

pipe(speech_file, return_timestamps=True, chunk_length_s=30, stride_length_s=[6,0], batch_size=32)

Expected behavior

Label: "Bazı Türkçe kelimeler."
Prediction: "Some Turkish words."

Narsil commented 1 year ago

@ArthurZucker

AhmedIdr commented 1 year ago

You can add generate_kwargs = {"language":"<|tr|>","task": "transcribe"}, to your pipeline initialization and it should work.

ArthurZucker commented 1 year ago

Updated the notebook with the following new line :

pipe(speech_file, generate_kwargs = {"task":"transcribe", "language":"<|fr|>"} )

melihogutcen commented 1 year ago

Voila! I am able to set the language by using generate_kwargs = {"language":"<|tr|>","task": "transcribe"} in pipeline initialization. Thanks.

AnestLarry commented 1 year ago

Hello, I got same problem. But generate_kwargs = {"language":"<|tr|>","task": "transcribe"} is not work for me.

ValueError: The following `model_kwargs` are not used by the model: ['task', 'language'] (note: typos in the generate arguments will also show up in this list)

Here is the code:

from transformers import WhisperProcessor,WhisperForConditionalGeneration
import whisper
from transformers import pipeline
model = WhisperForConditionalGeneration.from_pretrained("./whisper_tiny_pytorch_model.bin",config="./config.json").to("cuda:0")
processor = WhisperProcessor.from_pretrained("./")
audio = whisper.load_audio("./a.flac")
i = processor(audio,return_tensors="pt").input_features.to("cuda:0")
pipe = pipeline(
  "automatic-speech-recognition",
  model=model,
  tokenizer=processor.tokenizer,
  feature_extractor=processor.feature_extractor,
  chunk_length_s=30,
  device="cuda:0",
)
r = pipe(av, generate_kwargs  = {"task":"transcribe", "language":"japanese"})

Could you help me?

Env: pytorch==2.1.0.dev20230302+cu117 transformer==4.26.1 whisper model is download on huggingface.

ArthurZucker commented 1 year ago

Hey @AnestLarry, the language tag that you are using is wrong! As you can see in the generation_config.json, the lang_to_id defines the mapping from language token to the actual input ids. What you should be using (and there is an example of this in the notebook here ) is the following:

...
pipe( av, generate_kwargs = {"language"= "<|ja|>"}

AnestLarry commented 1 year ago

Hey @ArthurZucker ,

r = pipe(audio, generate_kwargs  = {"language":"<|ja|>"})

ValueError: The following `model_kwargs` are not used by the model: ['language'] (note: typos in the generate arguments will also show up in this list)

I still got the same error. When I using {"language": "<|ja|>"} to get_decoder_prompt_ids (in a way direct to using model generate), I got a error tips to change my arg.

processor.get_decoder_prompt_ids(language="<|ja|>",task="transcribe")

ValueError: Unsupported language: <|ja|>. Language should be one of: ['english', 'chinese', 'german', 'spanish', 'russian', 'korean', 'french', 'japanese', 'portuguese', 'turkish', 'polish', 'catalan', 'dutch', 'arabic', 'swedish', 'italian', 'indonesian', 'hindi', 'finnish', 'vietnamese', 'hebrew', 'ukrainian', 'greek', 'malay', 'czech', 'romanian', 'danish', 'hungarian', 'tamil', 'norwegian', 'thai', 'urdu', 'croatian', 'bulgarian', 'lithuanian', 'latin', 'maori', 'malayalam', 'welsh', 'slovak', 'telugu', 'persian', 'latvian', 'bengali', 'serbian', 'azerbaijani', 'slovenian', 'kannada', 'estonian', 'macedonian', 'breton', 'basque', 'icelandic', 'armenian', 'nepali', 'mongolian', 'bosnian', 'kazakh', 'albanian', 'swahili', 'galician', 'marathi', 'punjabi', 'sinhala', 'khmer', 'shona', 'yoruba', 'somali', 'afrikaans', 'occitan', 'georgian', 'belarusian', 'tajik', 'sindhi', 'gujarati', 'amharic', 'yiddish', 'lao', 'uzbek', 'faroese', 'haitian creole', 'pashto', 'turkmen', 'nynorsk', 'maltese', 'sanskrit', 'luxembourgish', 'myanmar', 'tibetan', 'tagalog', 'malagasy', 'assamese', 'tatar', 'hawaiian', 'lingala', 'hausa', 'bashkir', 'javanese', 'sundanese', 'burmese', 'valencian', 'flemish', 'haitian', 'letzeburgesch', 'pushto', 'panjabi', 'moldavian', 'moldovan', 'sinhalese', 'castilian'].

And I can get valid result with model generate.

forced_decoder_ids = processor.get_decoder_prompt_ids(language="japanese",task="transcribe")
r = model.generate(i,forced_decoder_ids = forced_decoder_ids)

out: ['<|startoftranscript|><|ja|><|transcribe|><|notimestamps|>夜が開き出し...<|endoftext|>']

ArthurZucker commented 1 year ago

Sorry I guess I should have been clearer: pipe( av, generate_kwargs = {"language"= "<|ja|>", "task"="transcribe"} (I was just sharing how to fix the language) Moreover, this is not on the latest release, as the notebook mentions you have to use the main branch

AnestLarry commented 1 year ago

Thank you for notion me the version problem ignored by me. I had run success (without error message) after install main branch. But fix the language still not work.

model = WhisperForConditionalGeneration.from_pretrained("./whisper_tiny_pytorch_model.bin",config="./config.json").to("cuda:0")
processor = WhisperProcessor.from_pretrained("./")
audio = whisper.load_audio("./a.mp3")
i = processor(audio,return_tensors="pt").input_features.to("cuda:0")
pipe = pipeline(
  "automatic-speech-recognition",
  model=model,
  tokenizer=processor.tokenizer,
  feature_extractor=processor.feature_extractor,
  chunk_length_s=30,
  device="cuda:0",
)

r = pipe(audio, generate_kwargs = {"language":"<|ja|>","task":"transcribe"})
{'text': " I'm not going bit ...}

I fixed ja and got a English result. (audio is a japanese song. Is the code wrong though?

ArthurZucker commented 1 year ago

Try using the notebook I provided, your custom model might not be working and I can't debug it for you 😅 Could you try using the openai/whisper-small model as shown in the notbook? Then you can compare the configuration file and generation config

AnestLarry commented 1 year ago

Very thank you. My model is download from huggingface without change anything from me. Just used openai/whisper to successfully complete the task. And I found that model file name look like effect the result. 😅 Change model file name whisper_tiny_pytorch_model.bin to pytorch_model.bin, and no problem now.

ArthurZucker commented 1 year ago

Great that you no longer have an issue! Thanks for bearing with me 🤗

peregilk commented 1 year ago

When I am installing the newest Transformers, I am now getting the following error setting language in the pipeline:

  File "/Users/me/miniconda3/envs/torch-gpu/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py", line 1570, in generate
    if generation_config.language in generation_config.lang_to_id.keys():
AttributeError: 'GenerationConfig' object has no attribute 'lang_to_id'

Lauler commented 1 year ago

I had this same issue with our finetuned whisper-large-rixvox @peregilk .

I think what happens is that finetuned Whisper models typically are already configured to predict a specific language during finetuning. When the people who train these models save a checkpoint, there is no "GenerationConfig" generated, as the model is still hardcoded to predict a specific language.

E.g. see generation_config.json from OpenAI/whisper-large-v2 and compare against a finetuned version of whisper where generation_config.json is missing.

If the person who trains a finetuned whisper follows Huggingface's finetuning instructions, there will be no GenerationConfig for the model.

Perhaps there should be a better error message for this @ArthurZucker .

The solution is simply to not specify generate_kwargs at all for any finetuned model where generation_config.json is missing. The finetuned model will predict in the language it was finetuned on without the generate_kwargs.

sanchit-gandhi commented 1 year ago

Thanks for reporting @peregilk and @Lauler! This is probably quite a good fix right @ArthurZucker? We don't use any of the generation_config logic unless generation_config.json is present on the Hub?

sanchit-gandhi commented 1 year ago

I believe the current workaround is to update the generation config according to this comment: https://github.com/huggingface/transformers/issues/21878#issuecomment-1451902363

This should fix both issues described above. It's cumbersome though and ideally we'd have a way of handling it in transformers!

kadek66 commented 1 year ago

Detecting language using up to the first 30 seconds. Use --language to specify the language Detected language: Javanese Hello, i'm using whisper to translate. how to change the detected langunge? what is the code? thanks in advance

kamalojasv181 commented 11 months ago

@ArthurZucker @sanchit-gandhi thanks, this worked, but I would expect that model.config.suppress_tokens = [50290] would work as well (50290 corresponds to the index of "<|ur|>". I wanted to supperess urdu) if I do not want to use pipeline but I still get the transcription in urdu. But in this case, what worked for me was model.config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(language="english", task="transcribe"). Just curious what is going on behind the scene. Thanks

sanchit-gandhi commented 11 months ago

Hey @kamalojasv181 - could you try updating the generation_config, since it receives priority over the config:

model.generation_config.suppress_tokens.append(50290)

=> this should set the probability of the <|ur|> to zero during generation.

The recommended API is now to pass language=..., task=... directly to generate. This takes precedence over all generation config / config attributes, and is far easier to set: https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate.language

E.g. see how we set the language="french" and task="transcribe" for this French speech transcription example:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")

# load streaming dataset and read first audio sample
ds = load_dataset("common_voice", "fr", split="test", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
input_speech = next(iter(ds))["audio"]

# pre-process audio sample to log-mel spectrogram
input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

# generate token ids
predicted_ids = model.generate(input_features, language="french", task="transcribe")

# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

This does the same thing as the forced decoder ids under the hood, setting the task/language token for Whisper: https://huggingface.co/openai/whisper-large-v2#usage

kamalojasv181 commented 11 months ago

Thanks

guimingyue commented 6 months ago

This help me just now

huggingface / transformers