Support local audio and video files

shun-liang commented 1 month ago

https://www.reddit.com/r/youtubedl/comments/1g574wr/comment/lsc7j6y/?context=3

It would be great if you could also point it at a local directory and transcribe into this nice format. I've been slowly working on something similar, I've got a directory of videos and I've downloaded whisper and started building a docker container for it to process.

https://www.reddit.com/r/DataHoarder/comments/1g4342q/comment/lsc8v1h/?context=3

Can I ask if there is any interest or plans to implement your program with local folders of stuff?

https://postimg.cc/xqg6Y11D

For additional context, I have folders of video lecture series. The videos do have subtitles but they are low-effort auto-generated and are wrong regularly on tech-related words. Whisper, on the other hand, works far better.

shun-liang commented 1 month ago

Loading the local video and audio files and just pipe them to Whisper and all the down streaming processing would be trivial.

However, to generate more accurate readable Whisper transcriptions (e.g. with punctuations), prompts with the context of the videos and audios need to be provided to Whisper through prompts. For online video and audio sources (e.g. YouTube, Apple Podcasts) yt2doc sanitizes the title and the description and feed them into Whisper as prompts. I don't think local files have that amount of context available.

I think there is a possible solution which is to send the file names of the local videos or audios to a local LLM and asks it to generate a few fictitious sentences, as the OpenAI Whisper coobook suggests. Will experiment with that idea.

xorrosive commented 3 weeks ago

Loading the local video and audio files and just pipe them to Whisper and all the down streaming processing would be trivial.

However, to generate more accurate readable Whisper transcriptions (e.g. with punctuations), prompts with the context of the videos and audios need to be provided to Whisper through prompts. For online video and audio sources (e.g. YouTube, Apple Podcasts) yt2doc sanitizes the title and the description and feed them into Whisper as prompts. I don't think local files have that amount of context available.

I think there is a possible solution which is to send the file names of the local videos or audios to a local LLM and asks it to generate a few fictitious sentences, as the OpenAI Whisper coobook suggests. Will experiment with that idea.

Hello, an idea to experiment with for providing context for offline videos could be description pages belonging to the video(s). For example, in the case of courses, there is usually a sales page or a course description or at least a "syllabus" attached (think, Udemy). So any video description - for work trainings there's always text surrounding it, for lectures same thing. What do you think?

Otherwise getting context from a descriptive filename, and possibly asking the user to describe in just a few words what the video is about, could potentially prove helpful. I'm not completely sure, but throwing it out there.

shun-liang commented 3 weeks ago

description pages belonging to the video(s).

Potentially. Though I am not sure yet about what the interface should look like, to support this but without massively bloating the scope of this project. Potentially we can open a command line option or a parameter whisper_prompt (as yt2doc can technically be used as a Python library), and the logic of crawling and parsing the web page into whisper prompt can be done somewhere else as a seperate project.

xorrosive commented 3 weeks ago

description pages belonging to the video(s).

Potentially. Though I am not sure yet about what the interface should look like, to support this but without massively bloating the scope of this project. Potentially we can open a command line option or a parameter whisper_prompt (as yt2doc can technically be used as a Python library), and the logic of crawling and parsing the web page into whisper prompt can be done somewhere else as a seperate project.

A "whisper prompt" passed as an argument sounds great. Max results for least effort.

shun-liang / yt2doc

Support local audio and video files #29