Rescale layer in whisper processor

JeffreyWardman commented 1 year ago

Feature request

Whisper processor does not currently rescale to the expected [-1, 1) that it requires.

Motivation

Consistency between model processor layers.

Your contribution

-

sgugger commented 1 year ago

Please provide a code reproducer for the bug you are experiencing or there is nothing we can do to help.

JeffreyWardman commented 1 year ago

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForCTC

def inference(input, processor, model):
    output = processor(input, sampling_rate=16000, return_tensors="pt")

    if "whisper" in processor.tokenizer_class.lower():
        input_features = output.input_features
        with torch.no_grad():
            logits = model.generate(input_features)
        transcription = processor.batch_decode(logits, skip_special_tokens=True, output_word_offsets=True)[0]
    else:
        input_features = output.input_values
        with torch.no_grad():
            logits = model(input_features).logits[0]
            predicted_ids = torch.argmax(logits, dim=-1)
        transcription = processor.decode(predicted_ids, output_word_offsets=True)
    return transcription

def get_transcript(audio, model, processor):
    audio_scaled = ((audio - audio.min()) / (audio.max() - audio.min())) * (2) - 1
    scaled_transcription = inference(audio_scaled, processor, model)
    unscaled_transcription = inference(audio, processor, model)
    return {"scaled": scaled_transcription, "unscaled": unscaled_transcription}

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
audio = ((audio - audio.min()) / (audio.max() - audio.min())) * 65535  # Rescale to [0, 65535] to show issue

whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-base.en")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base.en").to("cpu")

wav2vec_processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
wav2vec_model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h")

whisper_transcripts = get_transcript(audio, whisper_model, whisper_processor)
wav2vec_transcripts = get_transcript(audio, wav2vec_model, wav2vec_processor)
print(f"WHISPER: {whisper_transcripts}")
print(f"WAV2VEC: {wav2vec_transcripts}")

Output:

WHISPER: {'scaled': ' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.', 
'unscaled': ' I'}

WAV2VEC: {'scaled': Wav2Vec2CTCTokenizerOutput(text='MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL', char_offsets=None, word_offsets=[{'word': 'MISTER', 'start_offset': 28, 'end_offset': 40}, {'word': 'QUILTER', 'start_offset': 43, 'end_offset': 60}, {'word': 'IS', 'start_offset': 66, 'end_offset': 69}, {'word': 'THE', 'start_offset': 72, 'end_offset': 76}, {'word': 'APOSTLE', 'start_offset': 80, 'end_offset': 103}, {'word': 'OF', 'start_offset': 109, 'end_offset': 111}, {'word': 'THE', 'start_offset': 115, 'end_offset': 118}, {'word': 'MIDDLE', 'start_offset': 120, 'end_offset': 131}, {'word': 'CLASSES', 'start_offset': 133, 'end_offset': 156}, {'word': 'AND', 'start_offset': 168, 'end_offset': 172}, {'word': 'WE', 'start_offset': 174, 'end_offset': 178}, {'word': 'ARE', 'start_offset': 181, 'end_offset': 185}, {'word': 'GLAD', 'start_offset': 187, 'end_offset': 200}, {'word': 'TO', 'start_offset': 205, 'end_offset': 209}, {'word': 'WELCOME', 'start_offset': 212, 'end_offset': 229}, {'word': 'HIS', 'start_offset': 234, 'end_offset': 240}, {'word': 'GOSPEL', 'start_offset': 245, 'end_offset': 267}]),
 'unscaled': Wav2Vec2CTCTokenizerOutput(text='MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL', char_offsets=None, word_offsets=[{'word': 'MISTER', 'start_offset': 28, 'end_offset': 40}, {'word': 'QUILTER', 'start_offset': 43, 'end_offset': 60}, {'word': 'IS', 'start_offset': 66, 'end_offset': 69}, {'word': 'THE', 'start_offset': 72, 'end_offset': 76}, {'word': 'APOSTLE', 'start_offset': 80, 'end_offset': 103}, {'word': 'OF', 'start_offset': 109, 'end_offset': 111}, {'word': 'THE', 'start_offset': 115, 'end_offset': 118}, {'word': 'MIDDLE', 'start_offset': 120, 'end_offset': 131}, {'word': 'CLASSES', 'start_offset': 133, 'end_offset': 156}, {'word': 'AND', 'start_offset': 168, 'end_offset': 172}, {'word': 'WE', 'start_offset': 174, 'end_offset': 178}, {'word': 'ARE', 'start_offset': 181, 'end_offset': 185}, {'word': 'GLAD', 'start_offset': 187, 'end_offset': 200}, {'word': 'TO', 'start_offset': 205, 'end_offset': 209}, {'word': 'WELCOME', 'start_offset': 212, 'end_offset': 229}, {'word': 'HIS', 'start_offset': 234, 'end_offset': 240}, {'word': 'GOSPEL', 'start_offset': 245, 'end_offset': 267}])}

JeffreyWardman commented 1 year ago

You can see in the above that the transcript is gibberish for the unscaled whisper model. This is because it is taking in as input the range [0, 65535] rather than [-1, 1].

sgugger commented 1 year ago

Thanks! cc @sanchit-gandhi and @ArthurZucker

sanchit-gandhi commented 1 year ago

Hey @JeffreyWardman, this is a really interesting issue! I've chosen not to compare Whisper to Wav2Vec2 in my analysis, as these two systems are intrinsically different in how they process the audio inputs:

With Wav2Vec2, we first normalise the raw audio inputs to (mean, std) = (0, 1). We then pass the normalised audio inputs to the model (as you have done in your code example). In this way, Wav2Vec2 takes as input audio inputs.

This is exactly the operation that the Wav2Vec2 feature extractor performs for us:

normalised_audio = wav2vec_processor.feature_extractor(audio).input_values

With Whisper, we first convert the raw audio inputs to a log-Mel spectrogram, and then feed this spectrogram to the Whisper model. In contrast to Wav2Vec2, Whisper takes the log-Mel features as inputs to the model (rather than audio values).

The audio -> log-Mel conversion is exactly the operation that the Whisper feature extractor performs for us:

logmel_features = whisper_processor.feature_extractor(audio).input_features

I've had a dig through the original Whisper codebase and compared it to the paper - it seems as though they perform the feature normalisation in the log-Mel space (c.f. Section 2.2 of the paper):

To check whether we missed something with our implementation, I ran your code example on the original Whisper repo. To reproduce this, first install the original (OpenAI) version of the model from https://github.com/openai/whisper:

pip install git+https://github.com/openai/whisper.git

I then tweaked your code snippet to make it compatible with the OpenAI model, following the "official" example provided in https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb:

import torch
import whisper
from datasets import load_dataset

device = "cuda" if torch.cuda.is_available() else "cpu"

model = whisper.load_model("base.en")
model.to(device)

# define the decoding options
options = whisper.DecodingOptions(language="en", without_timestamps=True)

# load audio sample as before
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
audio = ((audio - audio.min()) / (audio.max() - audio.min())) * 65535  # Rescale to [0, 65535] to show issue

def inference(audio):
  # whisper pre-processor expects torch tensors (not np.arrays or lists)
  audio = torch.tensor(audio)
  audio = whisper.pad_or_trim(audio.flatten()).to(device)
  mel = whisper.log_mel_spectrogram(audio)

  results = model.decode(mel, options)
  return results.text

def get_transcript(audio):
  audio_scaled = ((audio - audio.min()) / (audio.max() - audio.min())) * (2) - 1
  scaled_transcription = inference(audio_scaled)
  unscaled_transcription = inference(audio)
  return {"scaled": scaled_transcription, "unscaled": unscaled_transcription}

original_transcripts = get_transcript(audio)
print("ORIGINAL OpenAI: \n", original_transcripts)

Print output:

ORIGINAL OpenAI:  
{'scaled': 'Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.',
'unscaled': 'I'}

Which is the same output that we got with Transformers Whisper. So we can be sure that the Transformers implementation matches the official OpenAI one ✅ Meaning that this is an intrinsic problem with the Whisper model (rather than a Transformers implementation one). I think this comes down to the fact that the Whisper model does not normalise the audio inputs prior to passing them to the log-Mel spectrogram.

In Transformers, we aim to provide a matching implementation to the original model. In that regard, I don't think that we can currently change the codebase for the Transformers Whisper model to normalise audio samples before computing the log-Mel spectrogram features, since this is an inherent limitation of the Whisper model. Instead, what I'll do is post this issue on the original codebase and ask the authors whether this behaviour is expected. If they update their codebase to normalise the inputs, we can do the same in Transformers 🤗

Hope that makes sense and thank you for the great issue!

(edit: opened a discussion thread on the original OpenAI repo, awaiting the author's response https://github.com/openai/whisper/discussions/428#discussion-4510905)

ArthurZucker commented 1 year ago

Thanks a lot @sanchit-gandhi 💯 , totally agree with you. Also in the various tests that I ran during the integration, I did not really have any issue with custom inputs, so I am also wondering id there are any potential application for that feature request? If yes, we could definitely add an optional argument, but otherwise, I am glad with keeping it close to the original codebase! 👍🏻

sanchit-gandhi commented 1 year ago

I think it makes sense to offer an (optional) argument to the feature-extractor indicating whether the audio inputs should be normalised in the audio space:

do_normalise (Optional, defaults to False): whether or not to normalise the audio inputs prior to computing the log-Mel features.

This would look something along the lines of:

from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base.en")
# don't normalise
input_features = feature_extractor(audio, do_normalise=False).input_features[0]
# do normalise
input_features = feature_extractor(audio, do_normalise=True).input_features[0]

-> we can add this quite easily for more control over inference

c.f. https://github.com/openai/whisper/discussions/428#discussioncomment-4057857

ArthurZucker commented 1 year ago

Adding it to my whisper to do list

huggingface / transformers