open-source-ideas / ideas

💡 Looking for inspiration for your next open source project? Or perhaps you've got a brilliant idea you can't wait to share with others? Open Source Ideas is a community built specifically for this! 👋
6.56k stars 221 forks source link

Youtube Subtitle Search #88

Open dufferzafar opened 6 years ago

dufferzafar commented 6 years ago

Jump to the exact point in a YouTube video where something was said.

Relevant Technology

Use youtube-dl to download only the subtitles of videos.

Use Xapian / Elasticsearch for searching the entire corpus of subtitles. Or any other good text search DB that supports wildcards, regexes etc.

I don't really know what sort of a UI would make the most sense - a native desktop app, a website, a cli tool.

Existing tools

You could also just download the subtitles and feed it to a tool like recoll - which alread has a GUI: http://www.lesbonscomptes.com/recoll/

But this would only allow you to find the video itself, and not really "jump to it".


WorldBrain.io is a browser extension that searches your bookmarks & history.

It would make for a fantastic extension to worldbrain itself.


A chrome extension: https://chrome.google.com/webstore/detail/youtube-subtitles-search/aijacjhncladoajlfgeffggcbbopcljc


A website: http://www.tubequizard.com/search.php?pattern=!T3BlbkNvdXJzZVdhcmU


Who is this for

Depending on how much complexity you add, by choosing the stack, this would be a moderate - advanced project.

Complexity

Required time (ETA)

KOLANICH commented 6 years ago

Doesn't YT have this feature out of the box? If I remember right the page of each video has transcribed subtitles in text form with time.

dufferzafar commented 6 years ago

Not sure if it does. How do you propose we test this?

KOLANICH commented 6 years ago

Just go to page of any video with english text cleafly spoken in it and check. I don't use YT web interface (it requires JS), so I'm not going to check it myself. If I remember right there was a dropdown menu with the trancsript.

darshkpatel commented 6 years ago

I know it's a long shot , but can we create a database of searchable video subtitles ?

It might even act as a nice database for training Various machine learning models

On Wed 8 Aug, 2018, 4:25 PM KOLANICH, notifications@github.com wrote:

Just go to page of any video with text spoken in it and check. I don't use YT web interface (it requires JS), so I'm not going to check it myself.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/open-source-ideas/open-source-ideas/issues/88#issuecomment-411367047, or mute the thread https://github.com/notifications/unsubscribe-auth/AKvJrr9-rPrvkseSRqYED5u9WYCyZ_Wfks5uOsO-gaJpZM4Vzrbl .

Haroenv commented 6 years ago

We actually made this as a feature for playlists for talks at Algolia, more info on the site: https://community.algolia.com/talksearch/

cc @pixelastic

pixelastic commented 6 years ago

Thanks for the ping @Haroenv.

You can see it live here, on all the Google I/O videos.

https://community.algolia.com/talksearch/demos/googleio/

It works by calling the YouTube API with some playlist ids, getting the list of videos of those playlists, and extracting the captions and other data, then pushing it to Algolia. The front-end is then some HTML/CSS/JS mixed with calls to the Algolia API.

The code of the crawler is open-source and available here: https://github.com/algolia/talksearch-scraper if you want to have a look. Issues, PR and forks are welcome :)

dufferzafar commented 6 years ago

I feel the true potential of this idea isn't over ALL youtube videos (unless you're Google ofcourse) but over restricted subsets of videos - like conference talks (which is what talksearch does.)

What other groups of videos exist?

What else?

pixelastic commented 6 years ago

The thing is, even Google can't provide search in all subtitles of all videos in all languages. Even for them, it's too much data to process. And YouTube philosophy is to not release a feature that would only work for a subset of users, but features that would work for them. So there is very low chance of having YouTube implementing this on their own any time soon.

Now, the first idea you describe (pasting a playlist link and having it instantly searchable) is already doable by hosting the TalkSearch code yourself. It currently only have a command line to index new playlists, but one could easily plug it to a website. The only limitation would be on the Algolia pricing size. Algolia free accounts can host up to 10k records, anything above that would require a paid plan. And as it is creating one record per line of transcript (in order to accurately jump to the right moment in the video), it quickly creates a lot of records.

For example, the above Google I/O example is close to 475k records. Other smaller conferences are between 80k and 200k records. This would put the price tag at something like $60-$100 per month. Of course, the TalkSearch version hosted on our side will be kept free forever (but will be limited to tech conferences).

As for watched videos, I guess the code could be updated to index videos directly instead of playlist, and have the extension maybe extracting the videoIds and sending them to a kind of TalkSearch REST API that would do the indexing in the background. This is entirely possible, but once again will be limited by the Algolia plan you'll have.

Of course, you're free to fork the TalkSearch code, keep the caption extraction logic and push it to another backend if you want. That's the beauty of Open-Source :)

iamsoorena commented 6 years ago

I just wanted to say that this idea is awesome and makes me so excited. why? Because, It can help so many people who are ESL like me to have something like a very big video-dictionary to find real life usages of english words in proper contexts. I'm in, for this project and whatever helps it in it's path. 😊