tap
is an audio transcriber for web radio
(so far just BBC)
The pre-requisites for installation are:
tensorflow-gpu
, or tensorflow
for CPU-only (required for speech segmentation, recommended
to install via pip)Dependencies are specified in requirements.txt
:
sidekit
matplotlib<3.3.0
due to a deprecation of the
warn
argument to use
Optional additional dependency (for my use):
conda create -y -n tap
conda activate tap
#conda install "cudatoolkit>=11.0,<11.0.221" -c conda-forge
conda install "cudatoolkit<11.2" -c conda-forge
conda install pytorch torchaudio -c pytorch
pip install -r requirements.txt
pip install -e . # or `pip install .` for a fixed installation
# Also run `pip install -e .` for quill if using
To install CUDA with conda (and get a managed CUDNN)
conda create -y -n tap
conda activate tap
conda install cudnn "cudatoolkit<11.2" -c conda-forge
conda install pytorch torchaudio -c pytorch
pip install -r requirements.txt
pip install -e . # or `pip install .` for a fixed installation
# Also run `pip install -e .` for quill if using
If using CUDA 11.1 for RTX, you'll also need to hardlink the shared library for libcusolver
for TensorFlow to work (for the InaSpeechSegmenter step), as
documented here
cd $CONDA_PREFIX/lib
sudo ln libcusolver.so.11 libcusolver.so.10 # hard link
cd -
For a given programme, we can make a Stream
object with its
URLs for the day's episode, download ("pull"), and segment ("preprocess")
the audio ready for transcription, which we kick off immediately:
from tap.stream import load_stream
stream = load_stream(transcribe=True)
- Current default programme for `load_stream` is the BBC R4 Today programme. - Current default value of the `transcribe` argument for `load_stream` is `False`. Setting it to `True` will initiate the transcription immediately upon creating the stream object. - This step automatically calculates the URLs of the MP4 segments by obtaining the episode ID from the list of available episodes from the stored programme (series) ID. - In future, when adding a new programme, it will be possible to search for the programme ID, or it could be obtained as the parent ID (given as episode metadata JSON) to then store within the module entry for that programme upon its creation in `tap.data.store` - To get its URLs for the day before yesterday, pass the `ymd_ago` argument (a tuple) e.g. `load_stream(ymd_ago=(0,0,-2))` or pass the `ymd` argument [either a `datetime.date` or an integer tuple `(y,m,d)`] for an absolute date e.g. `load_stream(ymd=(2021,2,8))` - The value for `max_s` is crucial to avoiding an out of memory error when running the model: the audio file is first split up based on pauses between speakers, but the `max_s` value (a float) sets the maximum number of seconds between the segments (i.e. maximum duration of audio clips to be transcribed). Default is 50 seconds based on my experience.
The load_stream
function initialises a Stream
object, and upon doing so the
Stream.pull()
, Stream.preprocess()
, and Stream.transcribe()
methods are called
in sequence, to pull the MP4 stream from its URLs, concatenate into a single output
and convert to WAV at 16 kHz
After this has been done, the transcript timings for each of the segments is stored in a TSV
so that it can be reloaded without having to recompute each time. To reload a stream that's
already been transcribed, use load_stream(reload=True)
(which will reload the segmented
audio clips, and the transcripts too if they exist), e.g. for the episode 5 days ago:
from tap.stream import load_stream
stream = load_stream(ymd_ago=(0,0,-5), reload=True)
To summarise the transcripts, we can't just merge them all (due to token limits of the language
models which do the summarisation). To merge the first two transcripts from a stream, pass to
tap.precis.summarise
:
from tap.stream import load_stream
from tap.precis import summarise
stream = load_stream(ymd_ago=(0,0,-5), reload=True)
all_transcripts = stream.transcript_timings.transcripts.tolist()
some_transcripts = " ".join(all_transcripts[:2])
summary = summarise(some_transcripts)
To process an entire stream then, we must summarise it in chunks:
from tap.stream import load_stream
from tap.precis import summarise_in_chunks
stream = load_stream(ymd=(2021,2,17), reload=True)
all_transcripts = stream.transcript_timings.transcripts.tolist()
summaries, chunk_sizes = summarise_in_chunks(all_transcripts)
This is facilitated as a pipeline, writing to a specified output directory
from tap.stream import load_stream
stream = load_stream(ymd=(2021,2,17), reload=True)
stream.export_transcripts(format="txt", out_dir="/path/to/output/")
For my personal use I combine this with quill
, to build a website:
from tap.stream import load_stream
stream = load_stream(ymd=(2021,2,17), reload=True)
stream.export_transcripts(out_format="mmd", domain="poll", single_file=True)
The single_file
option defaults to False, but this creates many files (one per transcript,
derived from a chunk of one or more audio segments, around 60 per 3 hour programme). With
single_file=True
, one file transcript_summaries.mmd
is generated for the web (i.e. a single
web page).
There's also a pipeline API version (which loads a 1.22GB DistilBART model,
sshleifer/distilbart-cnn-12-6
)
You can also use T5 which works best up to 512 tokens, even though it won't complain until OOM at 1024 tokens (source)
In the final step of preprocessing, the audio is chopped up ("segmented") at 'gaps' (typically, pauses between speech). This is obtained via the INA speech segmenter
First, the audio is labelled as speech/noise/music (by default it will also annotate gender, which in my experience gives more accurate speaker segmentation). While gender assignment is not necessary if we are solely interested in the blanks (annotated as `noEnergy`), obtaining it now means it's unnecessary to recompute later: This creates a TSV something like this: ```csv labels start stop male 0.0 1.72 noEnergy 1.72 2.32 male 2.32 19.32 noEnergy 19.32 19.78 male 19.78 38.44 noEnergy 38.44 38.82 male 38.82 39.92 noEnergy 39.92 40.5 male 40.5 59.96 ``` The benefit of calculating this once on the entire programme is that it's less likely to assign the "no energy" label to the speech immediately at the beginning of an arbitrarily segmented audio clip (e.g. previously I split the programme into 60 second breaks). Given a minimum window (e.g. 10 seconds) we can segment on these "no energy" pauses. Any smaller segments than this simply get fused together. Lastly, a Wav2Vec2 model trained for 960h is loaded from the HuggingFace Hub, and the text produced is annotated onto each segment in the `Stream.transcripts` attribute (which when set adds a column to the `Stream.transcript_timings` DataFrame).
The namespace of the channels provides an inventory, so running:
from tap import tap.data.store.channels
and typing channels.
Tab will enumerate the available channels
(i.e. those already stored).
For a given channel, tab completion (in future: command line tab completion) will give the path to a channel » station » programme, e.g.
channels.bbc.r4.today
tap.data.store.channels.bbc.r4.today
module provided as an example of
how to set up a programme for tap to download into.