shashikg / WhisperS2T

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
MIT License
318 stars 32 forks source link

dependency conflicts, please help me use your library! #29

Closed BBC-Esq closed 9 months ago

BBC-Esq commented 9 months ago

In an effort to incorporate your awesome library into my program, I'm trying to make sure that all the versions of my dependencies will work with yours. Would you mind specifying which versions your program requires as well as which range? For example, you just state "torch," "accelerate," "transformers" without specifying a release version (or a range).

For example, I'm currently using faster-whisper==0.10.0 . This version only supports CUDA 11.8 because the maximum version of ctranslate2 it supports is 3.24.

Ctranslate2 has come out, which supports CUDA 12. Finally, faster-whisper released version 1.0 today, which can use ctranslate2 4.0, but it's not on pypi.org yet...

Rather than way for it be uploaded to pypi.org, I'd like to switch to your program instead...it's faster anyways...

Here's my planned dependencies (including your library of course), and if you could please let me know of any conflicts with your library I'd appreciate it:

torch==2.2.0+cu121 torchvision==0.17.0+cu121 torchaudio==2.2.0+cu121

accelerate==0.25.0 optimum==1.15.0 numpy 1.26.4 tokenizers==0.15.2 huggingface_hub==0.20.3 transformers==4.37.2 openai==1.12.0 (not openai-whisper) nvidia-ml-py==12.535.133

My program of course has other dependencies that are installed (i.e. dependencies of dependencies), but these are all of the ones that are also listed as dependencies in your requirements.txt file.

PLEASE keep in mind that I would solely be using the ctranslate2 backend in your program. Thus, I would not need flash attention 2, for example, since ctranslate2 doesn't use it like I'm assuming your huggingface backend does. Any advice is much appreciated. Thanks.

shashikg commented 9 months ago

Hi @BBC-Esq if you plan to only use CTranslate2 backend. You can also skip nvidia-ml-py, accelerate and optimum (I guess transformers as well). Normally torch should not be an issue for CTranslate2 backend. PyTorch is used for the jitted VAD model and handling cuda tensors in this package. And PyTorch people generally ensures backward compatibility of jitted models, so normally there shouldn't be any issue whatever torch version you use.

BBC-Esq commented 9 months ago

Hi @BBC-Esq if you plan to only use CTranslate2 backend. You can also skip nvidia-ml-py, accelerate and optimum (I guess transformers as well). Normally torch should not be an issue for CTranslate2 backend. PyTorch is used for the jitted VAD model and handling cuda tensors in this package. And PyTorch people generally ensures backward compatibility of jitted models, so normally there shouldn't be any issue whatever torch version you use.

Thanks, I can't remember if ctranslate2 requires transformers or not...either way, my other libraries do so...

Can you do me another favor...I'm looking at model.py and the init.py (creates the WhisperModel(ABC)...trying to understand as a non-programmer by trade.

I'm pulling out the parameters so I can more easily set them from a config.yaml file...will integrate into a GUI...here's the relevant portion:

import whisper_s2t

model_kwargs = {
    'compute_type': 'float16',
    'asr_options': {
        "beam_size": 5,
        "best_of": 1,
        "patience": 2,
        "length_penalty": 1,
        "repetition_penalty": 1.01,
        "no_repeat_ngram_size": 0,
        "compression_ratio_threshold": 2.4,
        "log_prob_threshold": -1.0,
        "no_speech_threshold": 0.5,
        "prefix": None,
        "suppress_blank": True,
        "suppress_tokens": [-1],
        "without_timestamps": True,
        "max_initial_timestamp": 1.0,
        "word_timestamps": False,
        "sampling_temperature": 1.0,
        "return_scores": True,
        "return_no_speech_prob": True,
        "word_aligner_model": 'tiny',
        # "max_length": 256,
        # "max_text_token_len": 1024,
    },
    'model_identifier': "large-v2",
    'backend': 'CTranslate2',
    # "device": "cuda",
    # "device_index": 0,
    # "cpu_threads": 4,
    # "num_workers": 1,
}

model = whisper_s2t.load_model(**model_kwargs)

I think I've gotten all of the parameters in one place...Now I'm working on using the "transcribe" method instead of the "transcribe_with_vad".

Here's my current:

out = model.transcribe_with_vad(files,
                                lang_codes=lang_codes,
                                tasks=tasks,
                                initial_prompts=initial_prompts,
                                batch_size=48)

And here's what I tried:

out = model.transcribe(files,
                       lang_codes=lang_codes,
                       tasks=tasks,
                       initial_prompts=initial_prompts,
                       batch_size=48)

But I got this error:

  File "C:\PATH\Scripts\test-whisper_s2t\Lib\site-packages\whisper_s2t\backends\__init__.py", line 117, in transcribe
    res = self.generate_segment_batched(mels.to(self.device), prompts)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: WhisperModelCT2.generate_segment_batched() missing 2 required positional arguments: 'seq_lens' and 'seg_metadata'

Obviously I need to better understand the different requirements of the two methods...any advice to speed me along?

Also, I was curious why the transcribe_with_vad method can use tqdm while the transcribe method does not? Something to do with VAD itself? Thanks again for the great repository.

BBC-Esq commented 9 months ago

I saw the recent pull requests and replaced my pip installed sourcecode with the latest from the repository...and it worked. However, I'm still wondering how to get timestamps...I've tried changing True and False etc...

shashikg commented 9 months ago

Generally transcribe_with_vad will give better accuracy for languages supported by VAD model. For some languages VAD model can give poor performance. For those cases, it's better to use model.transcribe.

For word timestamps: https://github.com/shashikg/WhisperS2T/blob/main/docs.md#return-word-alignments

shashikg commented 9 months ago

parameters looks good to me. You can cross-check here for the CTranslate2 backend: https://github.com/shashikg/WhisperS2T/blob/main/whisper_s2t/backends/ctranslate2/model.py#L61 https://github.com/shashikg/WhisperS2T/blob/main/whisper_s2t/backends/ctranslate2/model.py#L14

BBC-Esq commented 9 months ago

Generally transcribe_with_vad will give better accuracy for languages supported by VAD model. For some languages VAD model can give poor performance. For those cases, it's better to use model.transcribe.

For word timestamps: https://github.com/shashikg/WhisperS2T/blob/main/docs.md#return-word-alignments

Seems like I only need to change word_timestamps to "true" in my script...but when I do it I still get a .txt file with no timestamps. Here's a small portion:

We have been a misunderstood and badly mocked org for a long time. When we started, we announced the org at the end of 2015 and said we were going to work on AGI, people thought we were batshit insane. I remember at the time, an eminent AI scientist at a large industrial AI lab was DMing individual reporters...

Is there some kind of parsing I have to do? Sorry, I'm just used to faster-whisper where you can specify timestamps and it puts each segment on a single line with the starting time for that segment at the beginning...Not sure if I have to parse this...

image

BBC-Esq commented 9 months ago

Ok, gpt4 helped a little after being fed your examples. I set without timestamps to false and word timestamps to true and use the following modification:

# Concatenate the text from all utterances
transcription = " ".join([_['text'] for _ in out[0]]).strip()

# Process timestamps
timestamps = []
for item in out[0]:  # Assuming you're focusing on the first file
    for word_info in item.get('word_timestamps', []):
        timestamps.append(f"{word_info['word']} ({word_info['start']}-{word_info['end']})")

# Combine transcription and timestamps
transcription_with_timestamps = transcription + "\n\nTimestamps:\n" + " ".join(timestamps)

# Save to file
with open('transcription_with_timestamps.txt', 'w') as f:
    f.write(transcription_with_timestamps)

However, it just piled the words and their timestamps at the end of the transcription.

Word or even segment, time stamps aren't that crucial for my program...Is there a way to have it automatically start a line for each segment with just the start time like faster-whisper or whisperx...or would I have to parse this data and construct the timestamp formatting myself...If the latter, I'll just go with the non-time stamp transcription. Thanks!

shashikg commented 9 months ago

You can use following:

whisper_s2t.write_outputs(out, format='txt', ip_files=files, save_dir="./save_dir")

whisper_s2t.write_outputs(out, format='txt', ip_files=files, save_dir="./save_dir", single_sentence_in_one_utterance=True, end_punct_marks=["?", "."]) # if word alignment enabled

More: https://github.com/shashikg/WhisperS2T/blob/main/docs.md#write-transcripts-to-a-file

BBC-Esq commented 9 months ago

I tried...

Traceback (most recent call last):
  File "C:\PATH\Scripts\test-whisper_s2t\test_whisper_s2t.py", line 49, in <module>
    whisper_s2t.write_outputs(out, format='txt', ip_files=files, save_dir="./save_dir")
  File "C:\PATH\Scripts\test-whisper_s2t\Lib\site-packages\whisper_s2t\utils.py", line 114, in write_outputs
    TranscriptExporter[format](transcript, file_name)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^
KeyError: 'txt'
out = model.transcribe(files,
                       lang_codes=lang_codes,
                       tasks=tasks,
                       initial_prompts=initial_prompts,
                       batch_size=48)

whisper_s2t.write_outputs(out, format='txt', ip_files=files, save_dir="./save_dir")

whisper_s2t.write_outputs(out, format='txt', ip_files=files, save_dir="./save_dir", single_sentence_in_one_utterance=True, end_punct_marks=["?", "."]) # if word alignment enabled
shashikg commented 9 months ago

Please pull the latest commits from the main branch. https://github.com/shashikg/WhisperS2T/pull/34

BBC-Esq commented 9 months ago

All questions/concerns addressed - closing.