With my discovery of the whisper.cpp project I had the idea of transcribing the podcast of some friends of mine, The Grey Nato initially, and now also the 40 and 20 podcast that I also enjoy.
It's running on my trusty M1 Mac Mini and the results (static websites) are deployed to
Take a look! This code and the sites are provided free of charge as a public service to fellow fans, listeners and those who find the results useful.
For a year or so we used OctoAI's paid service, but as of 11/1/2024, they're acquired and shut down. So now I'm spinning up a Flask wrapper for WhisperX on my compute server.
This repo is the code and some notes for myself and others. As of 10/9/2023, the code handles two podcasts and is working well.
All of these are run and orchestrated by two Makefiles. Robust, portable, deletes outputs if interrupted, working pretty well.
Makefiles are tricky to write and debug. I might need remake at some point. The makefile tutorial here was essential at several points - suffix rewriting, basename built-in, phony, etc. You can do a lot with a Makefile very concisely, and the result is robust, portable and durable. And fast.
Another good tutorial (via Lobste.rs) https://makefiletutorial.com/#top
Directory list from StackOverflow ... as one does.
For a while, the TGN podcast shared episode URLs with bit.ly. There are good reasons for this, but now when I want to sequentially retrieve pages, the bit.ly throws rate limits and I see no reason to risk errors for readers. So I've built a manual process:
For a project like this, you want a primary index / key / way to refer to an episode. The natural choice is "episode number". This is a field in the RSS XML:
itunes:episode
however! TGN was bad, and didn't include this. What's more, they had episodes in between episodes. The episode_number function in process.py handles this with a combination of techniques:
The story is very similar for per-episode URLs. Should be there, often are missing, and can sometimes be parsed out of the description.
40 & 20 has clean metadata, so this was a ton easier for their feed.
I was curious as to how this'd look, so I used the Python wordcloud tool. A bit fussy to work with my python 3.11 install:
python -m pip install -e git+https://github.com/amueller/word_cloud#egg=wordcloud
cat tgn/*.txt > alltext
wordcloud_cli --text alltext --imagefile wordcloud.png --width 1600 --height 1200
40 & 20, run Sep 24 2023 - fun to see the overlaps.