Closed BBC-Esq closed 9 months ago
Hi @BBC-Esq if you plan to only use CTranslate2 backend. You can also skip nvidia-ml-py, accelerate and optimum
(I guess transformers
as well). Normally torch
should not be an issue for CTranslate2 backend. PyTorch is used for the jitted VAD model and handling cuda tensors in this package. And PyTorch people generally ensures backward compatibility of jitted models, so normally there shouldn't be any issue whatever torch version you use.
Hi @BBC-Esq if you plan to only use CTranslate2 backend. You can also skip
nvidia-ml-py, accelerate and optimum
(I guesstransformers
as well). Normallytorch
should not be an issue for CTranslate2 backend. PyTorch is used for the jitted VAD model and handling cuda tensors in this package. And PyTorch people generally ensures backward compatibility of jitted models, so normally there shouldn't be any issue whatever torch version you use.
Thanks, I can't remember if ctranslate2 requires transformers or not...either way, my other libraries do so...
Can you do me another favor...I'm looking at model.py
and the init.py
(creates the WhisperModel(ABC)...trying to understand as a non-programmer by trade.
I'm pulling out the parameters so I can more easily set them from a config.yaml file...will integrate into a GUI...here's the relevant portion:
import whisper_s2t
model_kwargs = {
'compute_type': 'float16',
'asr_options': {
"beam_size": 5,
"best_of": 1,
"patience": 2,
"length_penalty": 1,
"repetition_penalty": 1.01,
"no_repeat_ngram_size": 0,
"compression_ratio_threshold": 2.4,
"log_prob_threshold": -1.0,
"no_speech_threshold": 0.5,
"prefix": None,
"suppress_blank": True,
"suppress_tokens": [-1],
"without_timestamps": True,
"max_initial_timestamp": 1.0,
"word_timestamps": False,
"sampling_temperature": 1.0,
"return_scores": True,
"return_no_speech_prob": True,
"word_aligner_model": 'tiny',
# "max_length": 256,
# "max_text_token_len": 1024,
},
'model_identifier': "large-v2",
'backend': 'CTranslate2',
# "device": "cuda",
# "device_index": 0,
# "cpu_threads": 4,
# "num_workers": 1,
}
model = whisper_s2t.load_model(**model_kwargs)
I think I've gotten all of the parameters in one place...Now I'm working on using the "transcribe" method instead of the "transcribe_with_vad".
Here's my current:
out = model.transcribe_with_vad(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=48)
And here's what I tried:
out = model.transcribe(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=48)
But I got this error:
File "C:\PATH\Scripts\test-whisper_s2t\Lib\site-packages\whisper_s2t\backends\__init__.py", line 117, in transcribe
res = self.generate_segment_batched(mels.to(self.device), prompts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: WhisperModelCT2.generate_segment_batched() missing 2 required positional arguments: 'seq_lens' and 'seg_metadata'
Obviously I need to better understand the different requirements of the two methods...any advice to speed me along?
Also, I was curious why the transcribe_with_vad
method can use tqdm while the transcribe
method does not? Something to do with VAD itself? Thanks again for the great repository.
I saw the recent pull requests and replaced my pip installed sourcecode with the latest from the repository...and it worked. However, I'm still wondering how to get timestamps...I've tried changing True and False etc...
Generally transcribe_with_vad
will give better accuracy for languages supported by VAD model. For some languages VAD model can give poor performance. For those cases, it's better to use model.transcribe
.
For word timestamps: https://github.com/shashikg/WhisperS2T/blob/main/docs.md#return-word-alignments
parameters looks good to me. You can cross-check here for the CTranslate2 backend: https://github.com/shashikg/WhisperS2T/blob/main/whisper_s2t/backends/ctranslate2/model.py#L61 https://github.com/shashikg/WhisperS2T/blob/main/whisper_s2t/backends/ctranslate2/model.py#L14
Generally
transcribe_with_vad
will give better accuracy for languages supported by VAD model. For some languages VAD model can give poor performance. For those cases, it's better to usemodel.transcribe
.For word timestamps: https://github.com/shashikg/WhisperS2T/blob/main/docs.md#return-word-alignments
Seems like I only need to change word_timestamps to "true" in my script...but when I do it I still get a .txt file with no timestamps. Here's a small portion:
We have been a misunderstood and badly mocked org for a long time. When we started, we announced the org at the end of 2015 and said we were going to work on AGI, people thought we were batshit insane. I remember at the time, an eminent AI scientist at a large industrial AI lab was DMing individual reporters...
Is there some kind of parsing I have to do? Sorry, I'm just used to faster-whisper
where you can specify timestamps and it puts each segment on a single line with the starting time for that segment at the beginning...Not sure if I have to parse this...
Ok, gpt4 helped a little after being fed your examples. I set without timestamps to false and word timestamps to true and use the following modification:
# Concatenate the text from all utterances
transcription = " ".join([_['text'] for _ in out[0]]).strip()
# Process timestamps
timestamps = []
for item in out[0]: # Assuming you're focusing on the first file
for word_info in item.get('word_timestamps', []):
timestamps.append(f"{word_info['word']} ({word_info['start']}-{word_info['end']})")
# Combine transcription and timestamps
transcription_with_timestamps = transcription + "\n\nTimestamps:\n" + " ".join(timestamps)
# Save to file
with open('transcription_with_timestamps.txt', 'w') as f:
f.write(transcription_with_timestamps)
However, it just piled the words and their timestamps at the end of the transcription.
Word or even segment, time stamps aren't that crucial for my program...Is there a way to have it automatically start a line for each segment with just the start time like faster-whisper or whisperx...or would I have to parse this data and construct the timestamp formatting myself...If the latter, I'll just go with the non-time stamp transcription. Thanks!
You can use following:
whisper_s2t.write_outputs(out, format='txt', ip_files=files, save_dir="./save_dir")
whisper_s2t.write_outputs(out, format='txt', ip_files=files, save_dir="./save_dir", single_sentence_in_one_utterance=True, end_punct_marks=["?", "."]) # if word alignment enabled
More: https://github.com/shashikg/WhisperS2T/blob/main/docs.md#write-transcripts-to-a-file
I tried...
Traceback (most recent call last):
File "C:\PATH\Scripts\test-whisper_s2t\test_whisper_s2t.py", line 49, in <module>
whisper_s2t.write_outputs(out, format='txt', ip_files=files, save_dir="./save_dir")
File "C:\PATH\Scripts\test-whisper_s2t\Lib\site-packages\whisper_s2t\utils.py", line 114, in write_outputs
TranscriptExporter[format](transcript, file_name)
~~~~~~~~~~~~~~~~~~^^^^^^^^
KeyError: 'txt'
out = model.transcribe(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=48)
whisper_s2t.write_outputs(out, format='txt', ip_files=files, save_dir="./save_dir")
whisper_s2t.write_outputs(out, format='txt', ip_files=files, save_dir="./save_dir", single_sentence_in_one_utterance=True, end_punct_marks=["?", "."]) # if word alignment enabled
Please pull the latest commits from the main branch. https://github.com/shashikg/WhisperS2T/pull/34
All questions/concerns addressed - closing.
In an effort to incorporate your awesome library into my program, I'm trying to make sure that all the versions of my dependencies will work with yours. Would you mind specifying which versions your program requires as well as which range? For example, you just state "torch," "accelerate," "transformers" without specifying a release version (or a range).
For example, I'm currently using
faster-whisper==0.10.0
. This version only supports CUDA 11.8 because the maximum version of ctranslate2 it supports is 3.24.Ctranslate2 has come out, which supports CUDA 12. Finally, faster-whisper released version 1.0 today, which can use ctranslate2 4.0, but it's not on pypi.org yet...
Rather than way for it be uploaded to pypi.org, I'd like to switch to your program instead...it's faster anyways...
Here's my planned dependencies (including your library of course), and if you could please let me know of any conflicts with your library I'd appreciate it:
torch==2.2.0+cu121 torchvision==0.17.0+cu121 torchaudio==2.2.0+cu121
accelerate==0.25.0 optimum==1.15.0 numpy 1.26.4 tokenizers==0.15.2 huggingface_hub==0.20.3 transformers==4.37.2 openai==1.12.0 (not openai-whisper) nvidia-ml-py==12.535.133
My program of course has other dependencies that are installed (i.e. dependencies of dependencies), but these are all of the ones that are also listed as dependencies in your
requirements.txt
file.PLEASE keep in mind that I would solely be using the
ctranslate2
backend in your program. Thus, I would not needflash attention 2
, for example, sincectranslate2
doesn't use it like I'm assuming yourhuggingface
backend does. Any advice is much appreciated. Thanks.