Open FelippeChemello opened 1 year ago
Hi there! 👋 Definitely a possibility I'd say! I assume the original (openai) library analyses attention scores across the different language tokens and picks the most likely one. Do you perhaps have example code for how to achieve this with the python transformers
library?
Hi,
I found this code in a thread discussion on Hugging Face. It's not part of the default implementation of the transformers
library, but perhaps it could be added to the library to make it easier to obtain this information.
def detect_language(model: WhisperForConditionalGeneration, tokenizer: WhisperTokenizer, input_features,
possible_languages: Optional[Collection[str]] = None) -> List[Dict[str, float]]:
# hacky, but all language tokens and only language tokens are 6 characters long
language_tokens = [t for t in tokenizer.additional_special_tokens if len(t) == 6]
if possible_languages is not None:
language_tokens = [t for t in language_tokens if t[2:-2] in possible_languages]
if len(language_tokens) < len(possible_languages):
raise RuntimeError(f'Some languages in {possible_languages} did not have associated language tokens')
language_token_ids = tokenizer.convert_tokens_to_ids(language_tokens)
# 50258 is the token for transcribing
logits = model(input_features,
decoder_input_ids = torch.tensor([[50258] for _ in range(input_features.shape[0])])).logits
mask = torch.ones(logits.shape[-1], dtype=torch.bool)
mask[language_token_ids] = False
logits[:, :, mask] = -float('inf')
output_probs = logits.softmax(dim=-1).cpu()
return [
{
lang: output_probs[input_idx, 0, token_id].item()
for token_id, lang in zip(language_token_ids, language_tokens)
}
for input_idx in range(logits.shape[0])
]
Oh that looks much simpler than what I was expecting. I might need to make some modifications to the forward function for WhisperForConditionalGeneration
in transformers.js, but the majority of the functionality needed is already done.
Could you provide some example input and output so that I can make sure my implementation matches your example?
Is it possible to simply output the detected language as an event from the main pipeline?
The following output sample is different from the output of the previous code, but I believe it would be sufficient since it provides the probability of each language:
{"<LANG1>": 0.9, "<LANG2>": 0.05, "<LANG3>": 0.05}
Furthermore, an important point to consider is whether to pass only a chunk of audio to this function to make it faster or to pass each chunk and return the language probabilities along with each chunk_callback.
What are your thoughts on this structure? Is it possible?
Name of the feature Language Detection with the Whisper Model in Transformers.js
Reason for request The original Whisper model includes a dedicated function for language detection. It would be awesome to have a similar capability within Transformers.js. In my current application, Whisper's models are running efficiently on the browser. However, for language detection, I find myself repeatedly making requests to my backend service.
Additional context Is it possible to implement language detection as an event returned from the pipeline?