Internet Archive Podcasts

There are many podcasts published on Internet Archive under permissive licenses that can be transcribed with an ASR system like whisperX.

Below are the Internet Archive search queries for English podcasts under different licenses (see IA Search Guide for how to filter by license):

Public Domain - 8,131 results
CC-BY-SA - 10,061 results
CC-BY - 4,530 results

I have not tested and cannot verify whisperX transcription quality on non-English audio, but here are the search queries for permissively licensed podcasts in all languages:

Public Domain - 33,826 results
CC-BY-SA - 18,270 results
CC-BY - 17,720 results

With these search queries we can follow these instructions to bulk download the podcasts and then pass them through an ASR system.

r-three / common-pile

Internet Archive Podcasts #54