nypublicradio / audiogram

Turn audio into a shareable video.
MIT License
943 stars 335 forks source link

Better captioning #8

Open veltman opened 8 years ago

veltman commented 8 years ago

Have a mostly-working branch that allows for entering and positioning multiple captions, but the manual entry/interface is a real drag, especially for a long video. Worth exploring some improvements.

Forced aligners?

Using a forced aligner like Gentle to take a bulk transcript and automatically time it to the audio would help - then you could type in the whole thing (or paste from a transcript) and it could automatically break it into chunks.

Pros: Much faster if you have a full transcript already (paste the whole thing rather than pasting line-by-line and tweaking the timing). Cons: Not much faster if you don't have a transcript. A lot more code complexity (all the OSS aligners seem to be Python). Would probably still need to tweak the captions into sensible breaks (e.g. avoid orphan words).

Auto transcribe

Use some sort of speech-to-text to take a first pass at transcribing the audio. In-browser options include PocketSphinx and the Web Speech API in certain browsers. Server-side options include normal Sphinx or the Watson API.

Pros: Great when it works. Cons: Doesn't always work, especially for non-English languages or clips with music, background noise, etc. Still doesn't work out timing. If it's server-side, would require a second round-trip before the form submission. Could take a long time for long pieces of audio.

Parse timestamped transcripts?

Could allow people to upload an SRT or some other timecoded transcript format in the editor. The parsing wouldn't be that hard, but it's unclear how often audio orgs use these.

veltman commented 8 years ago

Looks like the Web Speech API doesn't provide any way to connect it to a non-mic source, but PocketSphinx does (with some fiddling).

kookster commented 8 years ago

you could also use other APIs like speechmatics (https://speechmatics.com/), or https://cloud.google.com/speech/ ?

veltman commented 8 years ago

Yup, true - though I'm a little reluctant to rely on an external API rather than something that can be bundled (ditto Watson).

pietrop commented 8 years ago

Hey @veltman, Gentle could be modified to generate a transcription when the text is not available. This already works in the REST API, see the curl example if you don't pass the text file it returns a transcription. but it doesn't work in the python terminal command. The code would need to be modified accordingly, which is something I am looking into.

I also played a round with pocket sphinx, packaging it as a node module https://github.com/OpenNewsLabs/offline_speech_to_text. I extracted it from video grep electron app.

iankevinmcdonald commented 7 years ago

Considering that the effective maximum on social media is 30s, I think that expecting users to supply a transcript is absolutely fine.

It doesn't scale to generating complete videos from long-form shows, but I think that's acceptable - it's still a big benefit for most uses.

I'm a one-person band working on my own community/radio niche narrative history series, and I've used SRT, using a free online manual transcriber (called, originally enough, "Transcriber"). Though I'm about as unrepresentative as you can possibly get.

pietrop commented 7 years ago

For the srt option I've wrote an srt parse composer that is also on npm.

Can be used to parse the srt into a word accurate json (original code to make it word accurate is from popcorn js srt parsing module parser also on github) with that is possible to make a "hyper transcript" where the user can make word accurate selections. I've done something similar in quickQuote (now refactored in node and in autoEdit) inspired by the hyperaudio project.

pettarin commented 7 years ago

Shameless plug, I hope you find it informative.

I maintain a Python/C forced aligner called aeneas ( http://www.readbeyond.it/aeneas/ and https://github.com/readbeyond/aeneas/ ). Its approach is not based on speech recognition (like Gentle and basically all other forced aligners out there), but on an older technique known as Dynamic Time Warping. It works decently well (and much faster) if you align text at sentence/phrase level, but it is worse at word-level. Its real time factor (ratio between processing time and real audio length) is between 0.005 and 0.02, depending on the parameters and machine CPU, since all the computational parts are written in C.

(In theory, one can port the core of aeneas to C, and from there to JS, via emscripten. It is a huge task, but it would enable decently fast alignment in JS land. Unfortunately, I have not had time/resources to do it.)

BTW, I maintain a list of forced aligners here: https://github.com/pettarin/forced-alignment-tools

pietrop commented 6 years ago

In case anyone is still looking into this turns out that @martymcguire had done a write up where he describe how he modified the BBC News Labs fork of Audiogram to work with Gentle Speech To Text Forced Aligner output, see his repo here.