linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

Regular whisper model is still downloaded when using hugginface models #70

Closed blueskyleaf closed 1 year ago

blueskyleaf commented 1 year ago

Hello! Thank you for adding the support for Huggingface Whisper models, like the NbAiLab fine tuned large-v2 model.

I would like to share a couple of observations I have made.

  1. When I load the "NbAiLab/whisper-large-v2-nob" model from huggingface, it downloads both that fine tuned model and converts it, then it also downloads Whispers regular large-v2 model.

  2. It uses about 3GB more GPU RAM, than if I use "large-v2" (Whispers original large model). I guess because it loads both models. So it goes from using 10.7GB while transcribing (using just normal whisper large-v2) to 13.2GB (using huggingface model). This causes an issue for me when transcribing in Google Colab, where sometimes a sentence gets repeated many times in the resulting transcription, which scews the time stamps also for the rest of the transcription. I guess this if because while the average GPU RAM load of 13.2GB. is too close to the 15GB limit of free Google Colab. This doesn't happen with normal Whisper large-v2.

  3. If I try to load a local ".pt", after the huggingface support update, I get an error

    "RuntimeError: Original error: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/content/large-v2.pt'. Use `repo_type` argument if needed.
    Could not find model /content/large-v2.pt from HuggingFace nor local folders."

    Whereas before this update it accepted the local .pt file.

Can these things be improved? Anyway, thank you!

blueskyleaf commented 1 year ago

I wonder if reversing the code here can help: https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/convert_openai_to_hf.py

Jeronymous commented 1 year ago

Thank you @blueskyleaf for poiting the problem of the early integration of HF models.

The memory consumption (2) should be easy an easy fix. However, I doubt there is a relation between this and the fact that you see repeated sentences. This might just come from the finetuned models. From what I've seen, the scripts to finetune these models do not use the timestamp prediction, they are usually finetuned on small speech extracts. So I suspect these models "forget" their capacity of predicting the end of each speech segment, which might have bad side effects when transcribing audios of more than 30 sec.

(3) should also be easy to fix.

For (1) (avoiding to download whisper models), it's not straightforward...

Jeronymous commented 1 year ago

All above-mentioned issues should be fixed now.

Thanks again @blueskyleaf for the very useful and complete issue description!