Open shun-liang opened 1 week ago
Loading the local video and audio files and just pipe them to Whisper and all the down streaming processing would be trivial.
However, to generate more accurate readable Whisper transcriptions (e.g. with punctuations), prompts with the context of the videos and audios need to be provided to Whisper through prompts. For online video and audio sources (e.g. YouTube, Apple Podcasts) yt2doc sanitizes the title and the description and feed them into Whisper as prompts. I don't think local files have that amount of context available.
I think there is a possible solution which is to send the file names of the local videos or audios to a local LLM and asks it to generate a few fictitious sentences, as the OpenAI Whisper coobook suggests. Will experiment with that idea.
https://www.reddit.com/r/youtubedl/comments/1g574wr/comment/lsc7j6y/?context=3
https://www.reddit.com/r/DataHoarder/comments/1g4342q/comment/lsc8v1h/?context=3