I have not tested and cannot verify whisperX transcription quality on non-English audio, but here are the search queries for permissively licensed podcasts in all languages:
Napkin math: 20k podcasts, ~30 minutes/podcast, 150 tokens/minute = 90M tokens? Not sure we should put a ton of effort into this but if it's easy to reuse the YouTube pipeline then it wouldn't hurt.
There are many podcasts published on Internet Archive under permissive licenses that can be transcribed with an ASR system like whisperX.
Below are the Internet Archive search queries for English podcasts under different licenses (see IA Search Guide for how to filter by license):
I have not tested and cannot verify whisperX transcription quality on non-English audio, but here are the search queries for permissively licensed podcasts in all languages:
With these search queries we can follow these instructions to bulk download the podcasts and then pass them through an ASR system.