m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.66k stars 1.34k forks source link

WhisperX return translated output instead of normal transcription #849

Open BankNatchapol opened 3 months ago

BankNatchapol commented 3 months ago

I tried to use fine-tuned model with whisperx, so i first convert the model using this code.

import ctranslate2
from transformers import AutoTokenizer, AutoProcessor

# converting model to CTranslate2

model_path = "biodatlab/whisper-th-large-v3"
output_dir = ""

converter = ctranslate2.converters.TransformersConverter(
    model_name_or_path=model_path,
    load_as_float16=None
)

converter.convert(output_dir=output_dir, quantization="float16", force=True)
print(f"Model successfully converted to CTranslate2 format at {output_dir}")

then run transcribe

import whisperx
lang = "th"
device = 'cuda'

## WhisperX
batch_size_x = 8 # reduce if low on GPU mem
compute_type_x = "float16"

asr_options = {
      "max_new_tokens": None,
      "clip_timestamps": None,
      "hallucination_silence_threshold": None,

}
model_x = whisperx.load_model("", 
'cuda', 
compute_type=compute_type_x, 
language='th', asr_options=asr_options)

print(model_x.transcribe('test.wav', language='th', task='transcribe'))

The output should be in 'th' language, instead the output is mostly 'en'.

..., {'text': ' When you try, you will find a gap in between. When you miss, you will sit on the hard floor. It makes you want to get up and fight. But when you fight and you start to get something, you will find a gap. This gap is a trap. Some people are good at it, but they miss it. Some people are good at it, but they miss it. Some people are good at it, but they miss it. Some people are good at it, but they miss it. Some people are good at it, but they miss it.', 'start': 1120.026, 'end': 1140.435}, {'text': '       ', 'start': 1140.435, 'end': 1159.053}, {'text': '          ', 'start': 1159.053, 'end': 1185.282}, {'text': ' The best in Thailand, the first in the Olympic life, has been reading books all the time. He has won the Olympic gold medal.', 'start': 1185.282, 'end': 1203.08}], 'language': 'th'}

In the wav file, he's speaking in 'th', but somehow the transcription is the translation of his speech.

Anything to fix this? Thank you in advance.