neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
9.52k stars 611 forks source link

Issue with Language Specific Transcription Using txtai and Whisper #593

Closed Nondzu closed 2 months ago

Nondzu commented 1 year ago

Environment

Description

I'm attempting to transcribe Polish audio using the Whisper model within txtai, and while I am able to get transcriptions, they appear to be in English rather than the native language of the audio.

Here's a snippet of the code I'm using:

from txtai.transcription import Transcription

transcribe = Transcription("openai/whisper-large-v2")
for text in transcribe(files):
    print(text)

Questions

  1. Does txtai's transcription feature automatically translate the text to English, or is it supposed to return text in the language of the audio?
  2. How can I disable any automatic translation feature or specify the language of the audio to ensure that the transcription is in Polish?

Any guidance or suggestions on this matter would be greatly appreciated.

Thank you!

Nondzu commented 1 year ago
image
davidmezzetti commented 1 year ago

It's possible Whisper runs the translation task by default. Here's an idea to test out using code from the model page.

from transformers import WhisperProcessor
from txtai.transcription import Transcription

transcribe = Transcription("openai/whisper-large-v2")

# Test transcribe only
transcribe.pipeline.model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="polish", task="transcribe")

for text in transcribe(files):
    print(text)

If that works, I can add in a change that makes this more streamlined.

Nondzu commented 1 year ago

@davidmezzetti thank you for help, after small mod this code works fine

from transformers import WhisperProcessor
from txtai.pipeline import Transcription

# from txtai.transcription import Transcription
# model = "openai/whisper-large-v2"
model = "bardsai/whisper-large-v2-pl-v2"
transcribe = Transcription(model)
processor = WhisperProcessor.from_pretrained(model)
# Test transcribe only
transcribe.pipeline.model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="polish", task="transcribe")

for text in transcribe(files):
    print(text)
image
davidmezzetti commented 1 year ago

Thanks for confirming. I'll keep this issue open and add an argument to disable automatic translation.