[Feature request] Whisper Language Detection

FelippeChemello commented 1 year ago

Name of the feature Language Detection with the Whisper Model in Transformers.js

Reason for request The original Whisper model includes a dedicated function for language detection. It would be awesome to have a similar capability within Transformers.js. In my current application, Whisper's models are running efficiently on the browser. However, for language detection, I find myself repeatedly making requests to my backend service.

Additional context Is it possible to implement language detection as an event returned from the pipeline?

xenova commented 1 year ago

Hi there! 👋 Definitely a possibility I'd say! I assume the original (openai) library analyses attention scores across the different language tokens and picks the most likely one. Do you perhaps have example code for how to achieve this with the python transformers library?

FelippeChemello commented 1 year ago

Hi, I found this code in a thread discussion on Hugging Face. It's not part of the default implementation of the transformers library, but perhaps it could be added to the library to make it easier to obtain this information.

def detect_language(model: WhisperForConditionalGeneration, tokenizer: WhisperTokenizer, input_features,
                    possible_languages: Optional[Collection[str]] = None) -> List[Dict[str, float]]:
    # hacky, but all language tokens and only language tokens are 6 characters long
    language_tokens = [t for t in tokenizer.additional_special_tokens if len(t) == 6]
    if possible_languages is not None:
        language_tokens = [t for t in language_tokens if t[2:-2] in possible_languages]
        if len(language_tokens) < len(possible_languages):
            raise RuntimeError(f'Some languages in {possible_languages} did not have associated language tokens')

    language_token_ids = tokenizer.convert_tokens_to_ids(language_tokens)

    # 50258 is the token for transcribing
    logits = model(input_features,
                   decoder_input_ids = torch.tensor([[50258] for _ in range(input_features.shape[0])])).logits
    mask = torch.ones(logits.shape[-1], dtype=torch.bool)
    mask[language_token_ids] = False
    logits[:, :, mask] = -float('inf')

    output_probs = logits.softmax(dim=-1).cpu()
    return [
        {
            lang: output_probs[input_idx, 0, token_id].item()
            for token_id, lang in zip(language_token_ids, language_tokens)
        }
        for input_idx in range(logits.shape[0])
    ]

xenova commented 1 year ago

Oh that looks much simpler than what I was expecting. I might need to make some modifications to the forward function for WhisperForConditionalGeneration in transformers.js, but the majority of the functionality needed is already done.

Could you provide some example input and output so that I can make sure my implementation matches your example?

FelippeChemello commented 1 year ago

Is it possible to simply output the detected language as an event from the main pipeline?

The following output sample is different from the output of the previous code, but I believe it would be sufficient since it provides the probability of each language:

{"<LANG1>": 0.9, "<LANG2>": 0.05, "<LANG3>": 0.05}

Furthermore, an important point to consider is whether to pass only a chunk of audio to this function to make it faster or to pass each chunk and return the language probabilities along with each chunk_callback.

What are your thoughts on this structure? Is it possible?

huggingface / transformers.js

[Feature request] Whisper Language Detection #302