r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
21 stars 6 forks source link

Internet Archive Podcasts #54

Open nkandpa2 opened 6 months ago

nkandpa2 commented 6 months ago

There are many podcasts published on Internet Archive under permissive licenses that can be transcribed with an ASR system like whisperX.

Below are the Internet Archive search queries for English podcasts under different licenses (see IA Search Guide for how to filter by license):

I have not tested and cannot verify whisperX transcription quality on non-English audio, but here are the search queries for permissively licensed podcasts in all languages:

With these search queries we can follow these instructions to bulk download the podcasts and then pass them through an ASR system.

craffel commented 2 months ago

Napkin math: 20k podcasts, ~30 minutes/podcast, 150 tokens/minute = 90M tokens? Not sure we should put a ton of effort into this but if it's easy to reuse the YouTube pipeline then it wouldn't hurt.