mheguy / transcription-bot

An app for creating transcript pages in the wiki for the Skeptic's Guide to the Universe podcast episodes.
MIT License
1 stars 0 forks source link

transcription-bot

transcriptionbot

This tools creates transcripts for episodes of the podcast "the Skeptic's Guide to the Universe".
This tool is a fan creation and is neither endorsed by nor associated with the creators of the podcast.

How it works

This explanation is targeted toward those who have no experience writing or reading code.

Transcription

We create a transcription of an episode. The transcription contains the words that were said as well as the time at which they were spoken.
That looks a bit like this:

0:01.234 - And
0:01.450 - that's
0:01.750 - why

For the sake of our demo, here are the words that we extracted without the timestamps.

And what's why taxes are so fascinating I'm not sure I agree alright, time for a quickie with Bob yes folks we're going to talk about lasers the pew pew kind?

The next layer is diarization. That tells us when a different person is speaking.

SPEAKER_01: 0:01-0:03
SPEAKER_04: 0:04-0:06
SPEAKER_03: 0:07-0:09
SPEAKER_02: 0:10-0:14
SPEAKER_05: 0:14-0:15

We merge the transcription and the diarization:

SPEAKER_01: And what's why taxes are so fascinating.
SPEAKER_04: I'm not sure I agree.
SPEAKER_03: Alright, time for a quickie with Bob.
SPEAKER_02: Yes folks we're going to talk about lasers.
SPEAKER_05: The pew pew kind?

And then we apply voiceprints we have on file to identify the speakers.

Evan: And what's why taxes are so fascinating.
Cara: I'm not sure I agree.
Steve: Alright, time for a quickie with Bob.
Bob: Yes folks we're going to talk about lasers.
Jay: The pew pew kind?

At this point, the transcription is completed and we have what is internally called a "diarized transcript".

Segment Data Gathering

The bot has information about all the recurring segment types.
But it needs to know what segments a particular episode contains.

To figure this out, we need data.
The two sources that we use for this data are the show notes web page, and the embedded lyrics in the episode mp3 file.

By combining the data from those two sources, we know what segments the episode contains and the order they are in.

Segmenting the Transcript

To continue the example from above, let's say we know that this episode has a "Quickie" segment.
The bot is programmed to look for the words "quickie with" to find the transition point into the segment.
This enables us to break the full transcript into the episode segments.

Cara: I'm not sure I agree.
== Quickie with Bob: Lasers == Steve: Alright, time for a quickie with Bob.

We use templates to ensure that we match the desired formatting for the wiki.

When Segmenting is Tricky

It's tricky to identify transitions into news segments. We have no "key words" that reliably tell us when a transition is happening.
So for this case, and as a fallback for all segment types when heuristics don't work, we send a chunk of transcript to GPT and ask it to identify the transition.

Other Odds and Ends

We download the image from the show notes page and upload it to the wiki. We add a caption that is generated by GPT (this results in something pretty bland and not specific to the episode).
We load the links to extract the article titles which are used in the references at the bottom of the wiki pages.

Development

The project uses Python 3.11 because many of the ML libraries have not yet adopted 3.12.
Poetry is used to manage dependencies. poetry install will get you set up.

There are a number of required env vars to run the tool. dotenv is set up, so we can place our variables into a .env file.

PYANNOTE_TOKEN is a token for pyannote.ai's services, which is what we use to handle diarization and speaker identification.
NGROK_TOKEN is also required for pyannote.ai as they return results via a webhook/callback.
WIKI_USERNAME and WIKI_PASSWORD are credentials for your bot account. You can create bot credentials at https://www.sgutranscripts.org/wiki/Special:BotPasswords.
OPENAI_API_KEY, OPENAI_ORGANIZATION, and OPENAI_PROJECT are all used for calls to GPT.

Ruff and Pyright should be used for linting and type checking.