ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.12k stars 3.59k forks source link

Transcript return a copyright string from a movie when specific tones are played #887

Open servin opened 1 year ago

servin commented 1 year ago

I'm transcribing some audio files from recorded calls, and when there's a pause with "silence" slightly noise the transcribe prints Subtítulos realizados por la comunidad de Amara.org

I'm using the large model at normal peed

On my understanding, this relates directly to the Training data of the model but don't know if anyone have some ideas to avoid this

"timestamps": { "from": "00:01:54,000", "to": "00:01:57,000" }, "offsets": { "from": 114000, "to": 117000 }, "text": " Subtítulos realizados por la comunidad de Amara.org" }, { "timestamps": { "from": "00:01:57,000", "to": "00:01:59,000" },

trholding commented 1 year ago

Seems like a issue with the model itself:

https://github.com/openai/whisper/discussions/928

I think to prevent such hallucinations and biases, one could use a good VAD such as https://github.com/snakers4/silero-vad .

A work around is to write a python script to pass the audio to Silero Vad, sox or even Audacity, and use the output files as input files for whisper.cpp . The resulting timestamps will not match that of the original audio files.

kgrusha commented 1 year ago

I might be wrong about this, but aren't both whisper.cpp and silero-vad MIT-licensed? What specifically makes them incompatible?

trholding commented 1 year ago

I might be wrong about this, but aren't both whisper.cpp and silero-vad MIT-licensed? What specifically makes them incompatible?

Apologies, mixed it up with their other stuff which is GPL. Edited.