lmmx / tap

dex ⠶ tap – an audio transcriber for web radio
MIT License
1 stars 0 forks source link

dex ⠶ tap

tap is an audio transcriber for web radio (so far just BBC)

Requirements

The pre-requisites for installation are:

Dependencies are specified in requirements.txt:

Optional additional dependency (for my use):

Suggested conda setup

conda create -y -n tap
conda activate tap
#conda install "cudatoolkit>=11.0,<11.0.221" -c conda-forge
conda install "cudatoolkit<11.2" -c conda-forge
conda install pytorch torchaudio -c pytorch
pip install -r requirements.txt
pip install -e . # or `pip install .` for a fixed installation
# Also run `pip install -e .` for quill if using

To install CUDA with conda (and get a managed CUDNN)

conda create -y -n tap
conda activate tap
conda install cudnn "cudatoolkit<11.2" -c conda-forge
conda install pytorch torchaudio -c pytorch
pip install -r requirements.txt
pip install -e . # or `pip install .` for a fixed installation
# Also run `pip install -e .` for quill if using

If using CUDA 11.1 for RTX, you'll also need to hardlink the shared library for libcusolver for TensorFlow to work (for the InaSpeechSegmenter step), as documented here

cd $CONDA_PREFIX/lib
sudo ln libcusolver.so.11 libcusolver.so.10 # hard link
cd -

Usage

Stream downloading and reloading from disk

For a given programme, we can make a Stream object with its URLs for the day's episode, download ("pull"), and segment ("preprocess") the audio ready for transcription, which we kick off immediately:

from tap.stream import load_stream
stream = load_stream(transcribe=True)
More details

- Current default programme for `load_stream` is the BBC R4 Today programme. - Current default value of the `transcribe` argument for `load_stream` is `False`. Setting it to `True` will initiate the transcription immediately upon creating the stream object. - This step automatically calculates the URLs of the MP4 segments by obtaining the episode ID from the list of available episodes from the stored programme (series) ID. - In future, when adding a new programme, it will be possible to search for the programme ID, or it could be obtained as the parent ID (given as episode metadata JSON) to then store within the module entry for that programme upon its creation in `tap.data.store` - To get its URLs for the day before yesterday, pass the `ymd_ago` argument (a tuple) e.g. `load_stream(ymd_ago=(0,0,-2))` or pass the `ymd` argument [either a `datetime.date` or an integer tuple `(y,m,d)`] for an absolute date e.g. `load_stream(ymd=(2021,2,8))` - The value for `max_s` is crucial to avoiding an out of memory error when running the model: the audio file is first split up based on pauses between speakers, but the `max_s` value (a float) sets the maximum number of seconds between the segments (i.e. maximum duration of audio clips to be transcribed). Default is 50 seconds based on my experience.

The load_stream function initialises a Stream object, and upon doing so the Stream.pull(), Stream.preprocess(), and Stream.transcribe() methods are called in sequence, to pull the MP4 stream from its URLs, concatenate into a single output and convert to WAV at 16 kHz

After this has been done, the transcript timings for each of the segments is stored in a TSV so that it can be reloaded without having to recompute each time. To reload a stream that's already been transcribed, use load_stream(reload=True) (which will reload the segmented audio clips, and the transcripts too if they exist), e.g. for the episode 5 days ago:

from tap.stream import load_stream
stream = load_stream(ymd_ago=(0,0,-5), reload=True)

To summarise the transcripts, we can't just merge them all (due to token limits of the language models which do the summarisation). To merge the first two transcripts from a stream, pass to tap.precis.summarise:

from tap.stream import load_stream
from tap.precis import summarise
stream = load_stream(ymd_ago=(0,0,-5), reload=True)
all_transcripts = stream.transcript_timings.transcripts.tolist()
some_transcripts = " ".join(all_transcripts[:2])
summary = summarise(some_transcripts)

To process an entire stream then, we must summarise it in chunks:

from tap.stream import load_stream
from tap.precis import summarise_in_chunks
stream = load_stream(ymd=(2021,2,17), reload=True)
all_transcripts = stream.transcript_timings.transcripts.tolist()
summaries, chunk_sizes = summarise_in_chunks(all_transcripts)

This is facilitated as a pipeline, writing to a specified output directory

from tap.stream import load_stream
stream = load_stream(ymd=(2021,2,17), reload=True)
stream.export_transcripts(format="txt", out_dir="/path/to/output/")

For my personal use I combine this with quill, to build a website:

from tap.stream import load_stream
stream = load_stream(ymd=(2021,2,17), reload=True)
stream.export_transcripts(out_format="mmd", domain="poll", single_file=True)

Preprocessing details

In the final step of preprocessing, the audio is chopped up ("segmented") at 'gaps' (typically, pauses between speech). This is obtained via the INA speech segmenter

More details

First, the audio is labelled as speech/noise/music (by default it will also annotate gender, which in my experience gives more accurate speaker segmentation). While gender assignment is not necessary if we are solely interested in the blanks (annotated as `noEnergy`), obtaining it now means it's unnecessary to recompute later: This creates a TSV something like this: ```csv labels start stop male 0.0 1.72 noEnergy 1.72 2.32 male 2.32 19.32 noEnergy 19.32 19.78 male 19.78 38.44 noEnergy 38.44 38.82 male 38.82 39.92 noEnergy 39.92 40.5 male 40.5 59.96 ``` The benefit of calculating this once on the entire programme is that it's less likely to assign the "no energy" label to the speech immediately at the beginning of an arbitrarily segmented audio clip (e.g. previously I split the programme into 60 second breaks). Given a minimum window (e.g. 10 seconds) we can segment on these "no energy" pauses. Any smaller segments than this simply get fused together. Lastly, a Wav2Vec2 model trained for 960h is loaded from the HuggingFace Hub, and the text produced is annotated onto each segment in the `Stream.transcripts` attribute (which when set adds a column to the `Stream.transcript_timings` DataFrame).

Catalogue exploration

The namespace of the channels provides an inventory, so running:

from tap import tap.data.store.channels

and typing channels.Tab will enumerate the available channels (i.e. those already stored).

For a given channel, tab completion (in future: command line tab completion) will give the path to a channel » station » programme, e.g.

channels.bbc.r4.today