linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
1.94k stars 155 forks source link

Consider Supporting CTranslate2 for faster inference #40

Open kamranjon opened 1 year ago

kamranjon commented 1 year ago

I recently learned about faster-whisper which uses the CTranslate2 library for faster inference. It seems you need to convert the whisper models first, but it claims the accuracy is the same for 4x speed improvements and reduced memory on both CPU and GPU.

I'm not sure if it would be feasible to support this but wanted to bring it up in case it was of interest. Feel free to close this issue if it is not possible.

Jeronymous commented 1 year ago

Thank you @kamranjon for letting us know :+1: I knew about whisper.cpp (which is unfortunately not working on GPU), but I did not (yet) about faster-whisper. It's definitely worth having a look.

If the model has the same interface as Whisper model, it's actually straightforward to test with whisper_timestamped.

I'm just a bit worried because this repo seems to re-implement the decoding from whisper, so it won't follow further improvement of whisper (I know they are still fixing some possible bugs, like an infinite loop when the timestamp prediction is stuck on <|0.00|>). Also the code is surprisingly short. So I am wondering: Do you know if it gives the same results as whisper (up to some random seeding issues...).

RaulKite commented 1 year ago

Have you seen this way to speedup inference through huggingface new method?

Automatic speech recognition pipeline 🚀 The prediction of timestamps is also available as part of the pipeline. It comes with a new feature: batched prediction. Long audio files can now be processed in a batched manner. This is made available by the _find_timestamp_sequence function, which is able to merge chunks of audios together based on timing information and timestamp prediction. In order to run the pipeline in batches, you must enable chunking by setting chunk_length_s = 30 as well as decide on a batch_size. This should allow for significant performance gain, with little losses in wer, depending on the hyperparmeters you define. The recommended parameters are chunk_length_s=30, stride_length_s=[6,0]. If you want to learn more about how these parameters can affect the final results, feel free to refer to the blogpost on chunking for ASR.

Jeronymous commented 1 year ago

So many things happening around Whisper, it becomes hard to follow :sweat_smile:

Thank you for coming in @RaulKite . Do you have a link?

I've played a bit with HuggingFace's transformers.Whisper* classes, but it was hard to recover the same accuracy as OpenAI implementation. I mean I could not get comparable WER with a simple implementation using whisper in transformers (although it unlocks batch processing, yes). But I gave a quick try, maybe I missed something.

RaulKite commented 1 year ago

https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor

ronyfadel commented 1 year ago

I'm just a bit worried because this repo seems to re-implement the decoding from whisper, so it won't follow further improvement of whisper

cc @guillaumekln

guillaumekln commented 1 year ago

Hello,

Yes, faster-whisper is a complete reimplementation of the Whisper model and transcription loop. It is the reason it can be much more efficient (while giving the same results in most cases). We are watching closely the main repo and will port any new improvements.

However, it is currently not compatible with extensions such as whisper-timestamped. As far as I understand, whisper-timestamped requires access to some model layers to get the attention weights or output logits. These outputs are currently not exposed to Python since most of the execution is happening in CTranslate2 which is a C++ library. Some additional work is needed to return all these intermediate values but it is not possible at the moment.

ronyfadel commented 1 year ago

Would it be possible to run the transcription through faster-whisper, and do all the post-processing that whisper-timestamped is doing using the regular whisper model? I reckon it'd still be faster than using vanilla whisper.

kamranjon commented 1 year ago

@ronyfadel unfortunately it seems the answer is no, it requires information that CTranslate2 does not surface - so it would require additional inference to be ran on the regular whisper model to surface that information and it would overall take more time. In the future if CTranslate2 surfaces some of these outputs through their Python API - it might be possible, but for now this is not feasible.

ronyfadel commented 1 year ago

What's the information that CTranslate2 doesn't surface, so that I understand better?

kamranjon commented 1 year ago

@ronyfadel

As far as I understand, whisper-timestamped requires access to some model layers to get the attention weights or output logits. These outputs are currently not exposed to Python since most of the execution is happening in CTranslate2 which is a C++ library. Some additional work is needed to return all these intermediate values but it is not possible at the moment.

ronyfadel commented 1 year ago

@ronyfadel

As far as I understand, whisper-timestamped requires access to some model layers to get the attention weights or output logits. These outputs are currently not exposed to Python since most of the execution is happening in CTranslate2 which is a C++ library. Some additional work is needed to return all these intermediate values but it is not possible at the moment.

You missed my comment.

I'm asking if the post-processing can be based on the vanilla whisper weights. Meaning: fast transcription using fast-whisper and slow alignment based on vanilla whisper.

Jeronymous commented 1 year ago

Yes @ronyfadel that's a good suggestion, I think with little modification, whisper-timestamped can decouple the transcription part from the alignment part. I'll look into that, with fast-whisper in mind

What's the information that CTranslate2 doesn't surface, so that I understand better?

The most critical seems to be the cross attention weights, that need to be accessed to do the alignment

ronyfadel commented 1 year ago

@Jeronymous bingo! (I'm still catching up and diving into the codebase).

all_hooks = []
all_hooks.append(model.encoder.conv1.register_forward_hook(hook_mfcc))
all_hooks.append(model.decoder.token_embedding.register_forward_hook(hook_input_tokens))
nblocks = len(model.decoder.blocks)
j = 0
for i, block in enumerate(model.decoder.blocks):
    if i < nblocks - word_alignement_most_top_layers:
        continue
    all_hooks.append(
        block.cross_attn.register_forward_hook(
            lambda layer, ins, outs, index=j: hook_attention_weights(layer, ins, outs, index))
    )
    j += 1
if compute_word_confidence or no_speech_threshold is not None:
    all_hooks.append(model.decoder.ln.register_forward_hook(hook_output_logits))

Without these hooks in CTranslate2 (and exposing the cross attention weights), I'm not sure how I can move forward :)

guillaumekln commented 1 year ago

While I don't plan on making the library compatible with these hooks, I'm working to expose a method align which can return the text/time alignments as implemented in openai/whisper:

https://github.com/OpenNMT/CTranslate2/pull/1120

I also have an experimental integration in faster-whisper that enables word-level timestamps. Follow these installation instructions if you want to try it out.

EDIT: word-level timestamps are now available on the master branch of faster-whisper.

erturkdotgg commented 9 months ago

No, no and please NO. CTranslate2 requires Nvidia cards and it doesn't have rOCM(AMD) support. This is the only modification that i can use with my AMD card so please do not bring ctranslate2 support.