transitive-bullshit / yt-semantic-search

OpenAI-powered semantic search for any YouTube playlist – featuring the All-In Podcast. 💪
https://all-in-on-ai.vercel.app
MIT License
519 stars 44 forks source link
openai pinecone podcast search youtube

Search the All-In Podcast using AI

YouTube Semantic Search

OpenAI-powered semantic search for any YouTube playlist — featuring the All-In Podcast 🔥

Build Status MIT License Prettier Code Formatting

Intro

I love the All-In Podcast. But search and discovery with podcasts can be really challenging.

I built this project to solve this problem... and I also wanted to play around with cool AI stuff. 😂

This project uses the latest models from OpenAI to build a semantic search index across every episode of the Pod. It allows you to find your favorite moments with Google-level accuracy and rewatch the exact clips you're interested in.

You can use it to power advanced search across any YouTube channel or playlist. The demo uses the All-In Podcast because it's my favorite 💕, but it's designed to work with any playlist.

How to get started

Note that a few episodes may not have automated English transcriptions available, and that the project uses a hacky HTML scraping solution for this, so a better solution would be to use Whisper to transcribe the episode's audio. Also, the project support sorting by recency vs relevancy.

Example Queries

Screenshots

Desktop light mode         Desktop dark mode

How It Works

Under the hood, it uses:

We use Node.js and the YouTube API v3 to fetch the videos of our target playlist. In this case, we're focused on the All-In Podcast Episodes Playlist, which contains 108 videos at the time of writing.

npx tsx src/bin/resolve-yt-playlist.ts

We download the English transcripts for each episode using a hacky HTML scraping solution, since the YouTube API doesn't allow non-OAuth access to captions. Note that a few episodes don't have automated English transcriptions available, so we're just skipping them at the moment. A better solution would be to use Whisper to transcribe each episode's audio.

Once we have all of the transcripts and metadata downloaded locally, we pre-process each video's transcripts, breaking them up into reasonably sized chunks of ~100 tokens and fetch it's text-embedding-ada-002 embedding from OpenAI. This results in ~200 embeddings per episode.

All of these embeddings are then upserted into a Pinecone search index with a dimensionality of 1536. There are ~17,575 embeddings in total across ~108 episodes of the All-In Podcast.

npx tsx src/bin/process-yt-playlist.ts

Once our Pinecone search index is set up, we can start querying it either via the webapp or via the example CLI:

npx tsx src/bin/query.ts

We also support generating timestamp-based thumbnails of every YouTube video in the playlist. Thumbnails are generated using headless Puppeteer and are uploaded to Google Cloud Storage. We also post-process each thumbnail with lqip-modern to generate nice preview placeholder images.

If you want to generate thumbnails (optional), run:

npx tsx src/bin/generate-thumbnails.ts

Note that thumbnail generation takes ~2 hours and requires a pretty stable internet connection.

The frontend is a Next.js webapp deployed to Vercel that uses our Pinecone index as a primary data store.

TODO

Feedback

Have an idea on how this webapp could be improved? Find a particularly fun search query?

Feel free to send me feedback, either on GitHub or Twitter. 💯

Credit

License

MIT © Travis Fischer

If you found this project interesting, please consider sponsoring me or following me on twitter twitter

The API and server costs add up over time, so if you can spare it, sponsoring on Github is greatly appreciated. 💕