shun-liang / yt2doc

YouTube, Apple Podcast (and more) to readable Markdown.
MIT License
208 stars 8 forks source link

Support local audio and video files #29

Open shun-liang opened 1 week ago

shun-liang commented 1 week ago

https://www.reddit.com/r/youtubedl/comments/1g574wr/comment/lsc7j6y/?context=3

It would be great if you could also point it at a local directory and transcribe into this nice format. I've been slowly working on something similar, I've got a directory of videos and I've downloaded whisper and started building a docker container for it to process.

https://www.reddit.com/r/DataHoarder/comments/1g4342q/comment/lsc8v1h/?context=3

Can I ask if there is any interest or plans to implement your program with local folders of stuff?

https://postimg.cc/xqg6Y11D

For additional context, I have folders of video lecture series. The videos do have subtitles but they are low-effort auto-generated and are wrong regularly on tech-related words. Whisper, on the other hand, works far better.

shun-liang commented 1 week ago

Loading the local video and audio files and just pipe them to Whisper and all the down streaming processing would be trivial.

However, to generate more accurate readable Whisper transcriptions (e.g. with punctuations), prompts with the context of the videos and audios need to be provided to Whisper through prompts. For online video and audio sources (e.g. YouTube, Apple Podcasts) yt2doc sanitizes the title and the description and feed them into Whisper as prompts. I don't think local files have that amount of context available.

I think there is a possible solution which is to send the file names of the local videos or audios to a local LLM and asks it to generate a few fictitious sentences, as the OpenAI Whisper coobook suggests. Will experiment with that idea.