YouTube live captioning plugin

morinted commented 6 years ago

A Plover plugin could be made to interface with YouTube live streams through their captions API available here:

https://support.google.com/youtube/answer/6077032?&ref_topic=2853697#

Some high-level acceptance criteria:

The user can write with their machine to create YouTube captions
Using the plugin should not interfere with the ability to use the Add Translation dialog
The user can configure lead/lag time, as described in the Google support article
The connection status is indicated so the user can be sure that they are still connected to the stream
It should be apparent to the captioner whether text is sent yet (i.e. still editable)

A good implementation would likely be a Tool plugin. For example of a Tool plugin, see Plover WPM Meter

SeaLiteral commented 6 years ago

How about also being able to read files and cue pre-written sentences if the stream has scripted parts.

SeaLiteral commented 6 years ago

I was going to put this on the chat, but then I just kept on writing, so it got long and it's barely edited, but I think the redundancy may actually make it easier to understand:

There's also how you synchronize the captions to the video in a sensible way. Things like showing one sentence at a time, but merging short sentences into long captions when they fit in two lines with less than N (typically 40) characters each. And simulating roll-on captions if you think that'll allow the captions to lag less behind the video. Maybe even take hints from commas as to where we should place newlines. And some words look odd at the start of lines and others look odd at the end of lines. It's usually bad to have a preposition (at, to, in, on...) or article (a, an, the, these, those, this, that when followed by something else than a punctuation mark) at the end of a line, and if you have a first name (John) at the end of one line, you don't want the next one to start with a last name (Smith). For offline captions, things are easiest to do manually, but for something live, I think you need rules like these:

Pressing enter always ends a caption. If the line is more than a certain length (set by the user, but a sensible default is 40), split it in lines no longer than that maximum length. Keep it to as few lines as it takes, but prefer to cut right after punctuation marks or, failing that, before prepositions or, failing that, before articles.
By default, a caption can contain up to two lines. But the maximum is configurable, and it's okay to send out captions shorter than the maximum.
If the user presses enter after a string that doesn't fit in the maximum amount of lines per caption, split it in several captions, shown in sequence.
When possible, each caption should be shown for at least a set minimum amount of time (looking at Spanish TV, it seems they have it set to one second, Danish prefers two seconds) and longer captions must be shown for longer. Reading speed varies a lot between countries, from about 140 WPM to about 200 WPM, but if the user is writing too fast, you may need to make it relative to how fast the captions are being written, just trying to make it so that the longer captions get more on-screen time than the shorter ones. Those who want to keep the absolute speeds can of course have the WPM meter on and just make sure they don't write too fast. Or the speaker could be reminded to slow down.

Also, if some part of the video is prerecorded (as sometimes happens on TV), maybe we could play back prewritten captions from a file and fetching timecodes from it as well.

SeaLiteral commented 6 years ago

And do we know what to put in those region fields? The most obvious meaning would be to simultaneously send out captions in several languages, but the example provided says "reg1cue1" which doesn't really look like a language thing. Of course, "region" could be taken as an odd way of naming different parts of the screen in case you want to show some captions in the top of the video rather than in the bottom, but if that were the case, they'd probably have documented it. The region field is optional, but if it could do any of those things, some might find it useful.

SeaLiteral commented 6 years ago

Do programmers understand everything on that page? As a translator I probably know more about subtitles for translation than about captions for accessibility, but I think I can tell it wasn't written by a captioner. And by that I mean it describes what the caption "signal" looks like but says too little about what it actually means: each caption has a time which I assume it's a starting time. So how long do they stay on screen? That's not explained, but if we look at live captions on TV, that stuff differs a lot between countries.

The examples have the caption text in all caps and one caption per word most of the time. But it says they should be UTF-8 encoded (which means there should be no technical reasons to uppercase everything), so maybe it's just to make it stand out from the protocol-specific stuff. And there's a lot of situations where sending out one word at a time would seem like a bad idea: just imagine if the stream has some scripted parts in it and you can't have pre-made captions for those parts. And if they're showing them as roll-up captions, they should either say so or change the first example to "I'm sending several segments". Otherwise it looks like some captions are shown for less than a second, and I don't think people would find that confortable to read. Also, they do mention newlines in the captions, and those are much more useful for pop-on captions than for roll-up ones.

And it's obviously further complicated by the fact that it's technically possible to send one type of captions through a protocol designed for the other one: You can send pop-on captions on a roll-up protocol by starting each caption with two newline characters (it might even be possible to blank the screen with a newlines-only caption), and you can do the reverse by sending one captions per segment ("This is//This is an//This is an example" and so on).

nsmarkop commented 6 years ago

I imagine most of these things will be cleared up when someone takes the time to attempt to start implementing this functionality. How long things stay on screen, how subsequent POSTs are merged for display, what those region fields do if you put different data in them, etc. are all things we can find through experimentation so having them here will be a good reference for whoever digs in. I imagine YouTube implements a lot of their own guidelines and structure for how everything works so we don't need to worry too much about the fine details but we'll have to see.

If we end up not being able to figure something out then we can narrow down our questions and get a succinct (long texts usually don't encourage a good response in these situations) email written to send to the email they list for clarification: yt-live-caps-support@google.com

openstenoproject / plover

YouTube live captioning plugin #902