Open shun-liang opened 1 month ago
Loading the local video and audio files and just pipe them to Whisper and all the down streaming processing would be trivial.
However, to generate more accurate readable Whisper transcriptions (e.g. with punctuations), prompts with the context of the videos and audios need to be provided to Whisper through prompts. For online video and audio sources (e.g. YouTube, Apple Podcasts) yt2doc sanitizes the title and the description and feed them into Whisper as prompts. I don't think local files have that amount of context available.
I think there is a possible solution which is to send the file names of the local videos or audios to a local LLM and asks it to generate a few fictitious sentences, as the OpenAI Whisper coobook suggests. Will experiment with that idea.
Loading the local video and audio files and just pipe them to Whisper and all the down streaming processing would be trivial.
However, to generate more accurate readable Whisper transcriptions (e.g. with punctuations), prompts with the context of the videos and audios need to be provided to Whisper through prompts. For online video and audio sources (e.g. YouTube, Apple Podcasts) yt2doc sanitizes the title and the description and feed them into Whisper as prompts. I don't think local files have that amount of context available.
I think there is a possible solution which is to send the file names of the local videos or audios to a local LLM and asks it to generate a few fictitious sentences, as the OpenAI Whisper coobook suggests. Will experiment with that idea.
Hello, an idea to experiment with for providing context for offline videos could be description pages belonging to the video(s). For example, in the case of courses, there is usually a sales page or a course description or at least a "syllabus" attached (think, Udemy). So any video description - for work trainings there's always text surrounding it, for lectures same thing. What do you think?
Otherwise getting context from a descriptive filename, and possibly asking the user to describe in just a few words what the video is about, could potentially prove helpful. I'm not completely sure, but throwing it out there.
description pages belonging to the video(s).
Potentially. Though I am not sure yet about what the interface should look like, to support this but without massively bloating the scope of this project. Potentially we can open a command line option or a parameter whisper_prompt
(as yt2doc can technically be used as a Python library), and the logic of crawling and parsing the web page into whisper prompt can be done somewhere else as a seperate project.
description pages belonging to the video(s).
Potentially. Though I am not sure yet about what the interface should look like, to support this but without massively bloating the scope of this project. Potentially we can open a command line option or a parameter
whisper_prompt
(as yt2doc can technically be used as a Python library), and the logic of crawling and parsing the web page into whisper prompt can be done somewhere else as a seperate project.
A "whisper prompt" passed as an argument sounds great. Max results for least effort.
https://www.reddit.com/r/youtubedl/comments/1g574wr/comment/lsc7j6y/?context=3
https://www.reddit.com/r/DataHoarder/comments/1g4342q/comment/lsc8v1h/?context=3