Whisper - get probability of detected language

antoinethl commented 6 months ago

System Info

transformers version: 4.38.0.dev0
Platform: Linux-4.15.0-142-generic-x86_64-with-glibc2.23
Python version: 3.10.11
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.2.0+cu121 (True)
Tensorflow version (GPU?): 2.12.0 (True)

Who can help?

@sanchit-gandhi I guess, since he's the one who provided the answer in the previous git issue.

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Following #25138, @sanchit-gandhi provided an answer to retrieve the language using Whisper model and processor (since Whisper conditionnal tokens include the language token). He later provided a little adaptation in order to get the probability of the language. This is a nice possibility. However, using the latest version of transformers it seems that it's not possible anymore (that's why I write it as a bug but could also be a feature request).

Quick example in order to check :

language_identification = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to("cuda:0")
lid_processor = WhisperProcessor.from_pretrained("openai/whisper-small")

audio, _ = librosa.load(<my_file>, sr=16000)

lid = lid_processor(audio, sampling_rate=16000, return_tensors="pt", truncation=True)
input_features = lid.input_features.to("cuda:0", torch.float32)

outputs = language_identification.generate(input_features, 
      output_scores=True,  
      return_dict_in_generate=True, 
      max_new_tokens=1)

pred_text = lid_processor.batch_decode(outputs.sequences, skip_special_tokens=False)
pred_text

pred_text is :

['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> 80']

Here we see the conditionnal tokens as well as my only transcription token 80 (because of max_new_tokens=1) The issue is that outputs.scores object (which is used for the probabilities of each token. Size is (N_TOKEN, 1, 51865), 51865 is Whisper vocabulary size) only returns the probabilities for the tokens after the conditionnal tokens. I.e, outputs.scores has a length of only 1 because I asked only 1 token for generation (if I would have wrote 5, I would have got a length of 5).

This means that using the transitions scores computed as follow :

transition_scores = language_identification.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)

will produce only the scores for the tokens generated after the specials tokens SoT, lang, task, notimestamps (if not asking for).

I also tried without asking for timestamps because my guess was that since notimestamps token is after lang and task, maybe having the notimestamps token injected manually was maybe making the code to fall in a special if condition where the scores of the previous tokens (lang and task) would be ignored somehow.

Expected behavior

I would have expected the outputs.scores to have the scores for the language token (if language isn't forced obviously) as it was probably meant to be according to the answer in #25138.

With that, we could easily guess the score for the language, and maybe have a ranking (like EN with score of 0.8, FR with score of 0.1 and so on).

antoinethl commented 6 months ago

I checked with a previous version of transformers (4.28.1) and it seems to work : max_new_tokens=1 generates only the language token, and so the score associated is indeed the one of the language token.

It is then a regression / bug in the new version of transformers.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker commented 5 months ago

Hey @antoinethl . Sorry for the delay, when you tried with the older version of transformers, are you sure that the decoder_input_ids were not just 2 tokens ? This could just mean that the generation config was changes (lots of updates), and by default it ads notimestamps and the predicted language token.

I am not the best person to talk about this, as I missed a few issues but sounds like it's not a regression, maybe a feature request˜!

sanchit-gandhi commented 5 months ago

Hey @antoinethl, sorry for the delay here. Previously, we computed the log-probs for the language and task tokens, even if these were implicitly specified by the user. Since https://github.com/huggingface/transformers/pull/28687, we now pass the language and task tokens as decoder input ids to the model. This saves two forward passes of the decoder per decoding loop, since we now don't have to run a forward pass for these tokens, but we loose the log-prob computation.

If you're happy doing an extra forward pass of the encoder and decoder, you can compute the language probability scores as follows:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset, Audio
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny", low_cpu_mem_usage=True)
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model.to(device)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(16_000))
sample = next(iter(dataset))

# pre-process the audio inputs for sequential long form generation
inputs = processor([sample["audio"]["array"], sample["audio"]["array"]], padding=True, truncation=False, return_attention_mask=True, return_tensors="pt", sampling_rate=16_000).to(device)

input_stride = model.model.encoder.conv1.stride[0] * model.model.encoder.conv2.stride[0]
num_segment_frames = input_stride * model.config.max_source_positions
batch_size = inputs.input_features.shape[0]

# predict the language from the first 30-second chunk
decoder_input_ids = (torch.ones((batch_size, 1), device=device, dtype=torch.long) * model.generation_config.decoder_start_token_id)
input_features = inputs.input_features[:, :, :num_segment_frames]

with torch.no_grad():
    logits = model(input_features, decoder_input_ids=decoder_input_ids).logits[:, -1]

# auto-regressively generate
pred_ids = model.generate(**inputs)
pred_text = processor.batch_decode(pred_ids)

language_probs = torch.gather(logits, 1, pred_ids[:, 1:2]).squeeze(1)

If you feel strongly that the language prob should also be part of the generation output, this is definitely something we can discuss. It's the first time I've seen this requested since we did the refactoring of Whisper generate, so to me it looks like solving it with an extra few lines of code and doing an extra forward pass might be the easiest solution here.

cc @kamilakesbi

amyeroberts commented 4 months ago

Gentle ping @kamilakesbi

sanchit-gandhi commented 4 months ago

For now I haven't seen any further requests to integrate the language probability as part of the Whisper output. In the interest of keeping the outputs from generate consistent with other models, I suggest we leave the generation code as is, and encourage users to run an extra encoder + decoder forward pass should they need the language probs.

sanchit-gandhi commented 3 months ago

Note that the return_language argument is available using the pipeline API. You can use it as follows @antoinethl:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=256,
    chunk_length_s=30.0,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample, return_language=True)
print(result)

Which gives the predicted language for each chunk:

{'text': ' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.',
 'chunks': [{'language': 'english',
   'timestamp': (0.0, 5.86),
   'text': ' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'}]}

hanif-rt commented 2 months ago

return_language doesn't seem to work with word-level timestamps.

kamilakesbi commented 2 months ago

Hi @hanif-rt, this should be solved with PR #31572 :)

dengchengxifrank commented 2 weeks ago

@sanchit-gandhi Hi, I wonder how can I directly get logits from "generate" method. Thanks

ArthurZucker commented 2 weeks ago

You should use output_scores=True, return_dict_in_generate=True when calling generate

dengchengxifrank commented 2 weeks ago

@ArthurZucker Thanks！！！ I am also wondering how I can check the original generate function code. In generation_whisper.py I can see super.generate(), but I don't know where can I see the super‘s function.

ArthurZucker commented 2 weeks ago

https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1588 🤗

dengchengxifrank commented 2 weeks ago

@ArthurZucker Thanks! 🤗

vchagari commented 1 week ago

@sanchit-gandhi: Setting "return_language" flag to true, is not helping for Mulit-Lingual use-case. Model is returning only one language even though there are multiple languages in a given audio and for english it is returning as language id as None.

TC and Results: Audio has the following contents: "Hello, how are you? Hola, como estas? Bonjour, como se va?". Model gave the following results: result:{'text': 'Hello, how are you? Hola, como estas? Bonjour, como se va?', 'chunks': [{'language': None, 'text': 'Hello, how are you? Hola, como estas? Bonjour, como se va?'}]

Ask: is there way to get all the language ids and its probabilities via the Pipeline interface?.

ArthurZucker commented 1 week ago

That is why this is a feature request! Otherwise to get all the predicted language ids, I am not entirely sure, have not dug in the generate code in a while, but the model itself cannot switch languages mid 30s audio, it's possible between each 30s samples.

pranavchaturved commented 15 hours ago

+1 to register the request for integrating the language probability as part of the Whisper output.

Suggestion: when return_language_prob in pipe(), returns language_prob of the language having max probablity.

@sanchit-gandhi

huggingface / transformers