phubbard / tgn-whisperer

Automate transcription of entire podcasts using WhisperX
https://www.phfactor.net/tgn/
GNU General Public License v3.0
4 stars 0 forks source link

Introduction

With my discovery of the whisper.cpp project I had the idea of transcribing the podcast of some friends of mine, The Grey Nato initially, and now also the 40 and 20 podcast that I also enjoy.

It's running on my trusty M1 Mac Mini and the results (static websites) are deployed to

Take a look! This code and the sites are provided free of charge as a public service to fellow fans, listeners and those who find the results useful.

For a year or so we used OctoAI's paid service, but as of 11/1/2024, they're acquired and shut down. So now I'm spinning up a Flask wrapper for WhisperX on my compute server.

This repo is the code and some notes for myself and others. As of 10/9/2023, the code handles two podcasts and is working well.

Goals

  1. Simple as possible - use existing tools whenever possible
  2. Incremental - be able to add new episodes easily and without reworking previous ones

Workflow and requirements

  1. Download the RSS file (process.py, using Requests)
  2. Parse it for the episode MP3 files (xmltodict)
  3. Call WhisperX on each (POST to Flask)
  4. Collation and speaker attribution (episode.py)
  5. Export text into markdown files (to_markdown.py)
  6. Generate a site with mkdocs
  7. Publish (rsync)

All of these are run and orchestrated by two Makefiles. Robust, portable, deletes outputs if interrupted, working pretty well.

Makefiles are tricky to write and debug. I might need remake at some point. The makefile tutorial here was essential at several points - suffix rewriting, basename built-in, phony, etc. You can do a lot with a Makefile very concisely, and the result is robust, portable and durable. And fast.

Another good tutorial (via Lobste.rs) https://makefiletutorial.com/#top

Directory list from StackOverflow ... as one does.

The curse of URL shorteners and bit.ly in particular

For a while, the TGN podcast shared episode URLs with bit.ly. There are good reasons for this, but now when I want to sequentially retrieve pages, the bit.ly throws rate limits and I see no reason to risk errors for readers. So I've built a manual process:

Episode numbers and URLs

For a project like this, you want a primary index / key / way to refer to an episode. The natural choice is "episode number". This is a field in the RSS XML:

itunes:episode

however! TGN was bad, and didn't include this. What's more, they had episodes in between episodes. The episode_number function in process.py handles this with a combination of techniques:

  1. Try the itunes:episode key
  2. Check the list of exceptions, keyed by string title
  3. Try to parse an integer from the title
  4. Starting at 2100, assign a number

The story is very similar for per-episode URLs. Should be there, often are missing, and can sometimes be parsed out of the description.

40 & 20 has clean metadata, so this was a ton easier for their feed.

Optional - wordcloud

I was curious as to how this'd look, so I used the Python wordcloud tool. A bit fussy to work with my python 3.11 install:

 python -m pip install -e git+https://github.com/amueller/word_cloud#egg=wordcloud
 cat tgn/*.txt > alltext
 wordcloud_cli --text alltext --imagefile wordcloud.png --width 1600 --height 1200

wordcloud

40 & 20, run Sep 24 2023 - fun to see the overlaps.

wordcloud_wcl