Open 240db opened 5 days ago
That's intriguing, thanks for sharing.
What would be the benefit of using Whisper + yt-dlp versus simply taking the pre-generated transcript form youtube? I am concerned the proposed solution would add too much latency + require additional API Keys / cost $.
Would appreciate if you could elaborate on your thoughts.
Thanks!
its no match to videos where the subtitle is available, but that is usually for an older video, not every video might have the subtitles especially the newer ones. For longer videos say a Fed or Central Bank speech, you can transcribe with faster tools but whisper is a bit more robust, even more robust than the auto generated youtube subtitles. So you would be able to generate podcast about more recent content, the content could be transcribed as its released (audio/video) even if no subtitle was provided.
The subtitles from youtube are great to cut the overhead and they also provide multilingual versions which is great, that whisper solution misses the translation part, but the transcription might not be as accurate as whisper large.
Anyway only for those videos that do not have subtitles or if you want to try a higher quality transcript, not that will matter too much for Gemini but it can help to get more immediate content parsed to the pipe
Awesome, this makes a lot of sense and sounds like an interesting enhancement!
Cool! Also, yt-dlp and whisper allow one to download any video, from any website really, it supports downloading videos from Instagram Reels or other third party sites.
As for whisper, in case users have a mp4 source or any external media, it could be processed with ffmpeg if needed then transcribed with whisper to make a transcript.txt.
v0.2.1 makes podcastfy multimodal; images + text for now but pathway to any modality. having said that, in case youtube video does not have captions, we should simply download the video and pass it to LLM as I'm passing images and text today. In addition to implementing download video feature, I'd need to add support for video in the LLM.
Inspiration
So there is a gradio space https://huggingface.co/spaces/hf-audio/whisper-large-v3 that uses whisper, from the hugging face api :
but you can also set Whisper locally, in this case, just including these two as dependencies in case you want to do this locally with open-source alternatives. Here is a
sketch for replacing the current youtube_transcriber.py
for one that uses Whisper and yt-dlp instead:
I think it is still calling ffmpeg to convert to audio which is not necessary as you can do it from yt-dlp to almost any audio format. I will edit this further