sandrohanea / whisper.net

Whisper.net. Speech to text made simple using Whisper Models
MIT License
547 stars 84 forks source link

Inaccurate Transcription and Language Model Issues #143

Closed romanokeser closed 10 months ago

romanokeser commented 10 months ago

I have encountered significant challenges using whisper for speech-to-text conversion in the Croatian language. Unlike English, the system consistently produces inaccurate transcriptions when using Croatian audio inputs. There is no difference in outputs between CreateBuilder().WithLanguage("auto") and CreateBuilder().WithLanguage("Croatian")

Any suggestions? Thanks!

gorokhovskiy commented 10 months ago

This project is a .NET wrapper around the functionality developed by OpenAI. The developers of Whisper.NET have no control over the quality of the speech recognition. I could suggest to ask for help the participants of the following discussion: https://github.com/openai/whisper/discussions/16

sandrohanea commented 10 months ago

Hello @romanokeser ,

Indeed, @gorokhovskiy is right, this is just a wrapper of whisper.cpp which is a C++ port of OpenAI Whisper, and the models are coming from the open ai.

Besides the issue with the Serbian-Croation languages, I can offer some additional ideas on how to improbe the quality of transcripts (for any language):

  1. Use a larger model. Ofc, this will require more memory and the transcript will be slower but the quality will be improved.
  2. Finetune your own model. You can also finetune your own model for a specific language, but that's a little harder:

    1. Find a Croatian dataset with labels.
    2. Follow: https://huggingface.co/blog/fine-tune-whisper
    3. With the resulted model, follow: https://github.com/ggerganov/whisper.cpp/blob/master/models/README.md#fine-tuned-models to map it to ggml format
    4. Use that ggml model with Whisper.net
    5. Optional but nice => share the finetuned model with the community.

To answer your questions about WithLanguage("auto") vs WithLanguage("Croatian"): auto will first run the language identification and will detect the language of your audio, transcribing (or translation) will be identical after the language identification phase.

Shorter said: the auto will just make it a little slower until you will get the first results, but the quality will be the same.

Notes: it can be worse in case auto is detecting a different language (by error) => e.g. other Slavic language.