swyxio / swyxdotio

This is the repo for swyx's blog - Blog content is created in github issues, then posted on swyx.io as blog pages! Comment/watch to follow along my blog within GitHub
https://swyx.io
MIT License
325 stars 43 forks source link

How to transcribe podcast audio (WhisperX with speaker diarization) #470

Closed swyxio closed 1 year ago

swyxio commented 1 year ago

category: tutorial slug: transcribe-podcasts-with-whisper tag: podcasts, ai, whisper cover_image: https://user-images.githubusercontent.com/6764957/221219413-e83cec72-3164-40ad-bd48-0ca41616f224.png

Note: sometimes WhisperX is WAAYYYY too slow so I often end up using https://github.com/ggerganov/whisper.cpp which somehow runs much faster.

I do a lot of podcast transcription work and had need for it again today. The HuggingFace spaces (like this one https://huggingface.co/spaces/vumichien/whisper-speaker-diarization) always error out so aren't very useful.

This is the one that worked for me.

Note: if you run into a New error: 'soundfile' backend is not available error, conda install -c conda-forge libsndfile to fix.

  1. make sure you have the .wav for your podcast audio. you can use quicktime or audacity to convert it. this process doesnt work for mp3
  2. pip3 install git+https://github.com/m-bain/whisperx.git this will take a couple minutes. meanwhile...
  3. Read https://github.com/m-bain/whisperX#voice-activity-detection-filtering--diarization. To enable VAD filtering and Diarization, include your Hugging Face access token that you can generate from Here after the --hf_token argument and accept the user agreement for the following models: Segmentation , Voice Activity Detection (VAD) , and Speaker Diarization. make sure to accept them all in your huggingface account.
  4. whisperx YOUR_AUDIO_FILE.wav --hf_token YOUR_HF_TOKEN_HERE --vad_filter --diarize --min_speakers 3 --max_speakers 3 --language en for 3 speakers in English. remember it must be a .wav file.

image

It takes about 30 seconds to transcribe 30 seconds so be prepared for it to take the time of your audio podcast to transcribe.

image

swyxio commented 1 year ago

Related reads:

cmuhire commented 1 year ago

@sw-yx after reading the above I think you'll appreciate this (think DHH's famous "blog in 15 minutes with Rails" but for embedding real-time Whisper audio transcription in a Phoenix app) 🙂 https://www.youtube.com/watch?v=Yd220Te8cHc

Most of the ML community has not yet caught up with the incredible tooling development that's been quietly brewing in the Elixir community through projects like the Nx ecosystem and Livebook in just the last 2 (!) years but I expect this to change soon as things start to come together into a coherent story that showcases the unique value prop of the platform (in no small part thanks to the unique guarantees of the BEAM) compared to the incumbent stacks like Python/R

swyxio commented 1 year ago

v cool. will keep a look out!