m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.62k stars 1.34k forks source link

Is there a way to not load model between each transcription? #528

Open victor-upmeet opened 1 year ago

victor-upmeet commented 1 year ago

If I load the model, transcribe a first audio file, then transcribe a second audio file, I will have a problem if the language is not the same between them (the language will somehow be assigned to the model, and the second file will be transcribed on the assumption that the language is the same). I understand that currently, I need to load the model before each new file I need the transcription of. The model loads in approximately 1.5 seconds on my server, is there any way to make it so that I can load the model only once, and then transcribe audio files as they come? It would fasten the transcriptions by a lot at the end of the day if I have a lot of small files to transcribe.

RaulKite commented 1 year ago

Use Python instead of command line. But you have to program it yourself.

victor-upmeet commented 1 year ago

I do use Python, and currently have the problem mentioned:

# Load the model once
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

# Loading and transcribing a first audio file
first_audio = whisperx.load_audio(first_audio_file)
first_result = model.transcribe(first_audio, batch_size=batch_size)

# At this point, the language has been detected (if not specified in load_model)
# So, the following calls will transcribe the audio file with the same language as the first file
second_audio = whisperx.load_audio(second_audio_file)
second_result = model.transcribe(second_audio, batch_size=batch_size)

I have tried this code snippet with a first audio file in English, and a second audio file in French. For the second audio file transcription, language is not detected and is assumed to be English.

Therefore, if I want something correct, I need to call load_model each time I have a new audio file to transcribe.

My question is: why is it that way (OpenAi's whisper and faster-whisper do not have this issue, you can call load_model once and transcribe multiple audio files in different languages), and is there a way to change this behaviour to optimize compute/inference time?

sorgfresser commented 1 year ago

You just have to replace the tokenizer actually, see this line

e.g. model.tokenizer = ... and adjust model.preset_language