openstenoproject / plover

Open source stenotype engine
http://opensteno.org/plover
GNU General Public License v2.0
2.32k stars 281 forks source link

YouTube live captioning plugin #902

Open morinted opened 6 years ago

morinted commented 6 years ago

A Plover plugin could be made to interface with YouTube live streams through their captions API available here:

https://support.google.com/youtube/answer/6077032?&ref_topic=2853697#

Some high-level acceptance criteria:

A good implementation would likely be a Tool plugin. For example of a Tool plugin, see Plover WPM Meter

SeaLiteral commented 6 years ago

How about also being able to read files and cue pre-written sentences if the stream has scripted parts.

SeaLiteral commented 6 years ago

I was going to put this on the chat, but then I just kept on writing, so it got long and it's barely edited, but I think the redundancy may actually make it easier to understand:

There's also how you synchronize the captions to the video in a sensible way. Things like showing one sentence at a time, but merging short sentences into long captions when they fit in two lines with less than N (typically 40) characters each. And simulating roll-on captions if you think that'll allow the captions to lag less behind the video. Maybe even take hints from commas as to where we should place newlines. And some words look odd at the start of lines and others look odd at the end of lines. It's usually bad to have a preposition (at, to, in, on...) or article (a, an, the, these, those, this, that when followed by something else than a punctuation mark) at the end of a line, and if you have a first name (John) at the end of one line, you don't want the next one to start with a last name (Smith). For offline captions, things are easiest to do manually, but for something live, I think you need rules like these:

Also, if some part of the video is prerecorded (as sometimes happens on TV), maybe we could play back prewritten captions from a file and fetching timecodes from it as well.

SeaLiteral commented 6 years ago

And do we know what to put in those region fields? The most obvious meaning would be to simultaneously send out captions in several languages, but the example provided says "reg1cue1" which doesn't really look like a language thing. Of course, "region" could be taken as an odd way of naming different parts of the screen in case you want to show some captions in the top of the video rather than in the bottom, but if that were the case, they'd probably have documented it. The region field is optional, but if it could do any of those things, some might find it useful.

SeaLiteral commented 6 years ago

Do programmers understand everything on that page? As a translator I probably know more about subtitles for translation than about captions for accessibility, but I think I can tell it wasn't written by a captioner. And by that I mean it describes what the caption "signal" looks like but says too little about what it actually means: each caption has a time which I assume it's a starting time. So how long do they stay on screen? That's not explained, but if we look at live captions on TV, that stuff differs a lot between countries.

The examples have the caption text in all caps and one caption per word most of the time. But it says they should be UTF-8 encoded (which means there should be no technical reasons to uppercase everything), so maybe it's just to make it stand out from the protocol-specific stuff. And there's a lot of situations where sending out one word at a time would seem like a bad idea: just imagine if the stream has some scripted parts in it and you can't have pre-made captions for those parts. And if they're showing them as roll-up captions, they should either say so or change the first example to "I'm sending several segments". Otherwise it looks like some captions are shown for less than a second, and I don't think people would find that confortable to read. Also, they do mention newlines in the captions, and those are much more useful for pop-on captions than for roll-up ones.

And it's obviously further complicated by the fact that it's technically possible to send one type of captions through a protocol designed for the other one: You can send pop-on captions on a roll-up protocol by starting each caption with two newline characters (it might even be possible to blank the screen with a newlines-only caption), and you can do the reverse by sending one captions per segment ("This is//This is an//This is an example" and so on).

nsmarkop commented 6 years ago

I imagine most of these things will be cleared up when someone takes the time to attempt to start implementing this functionality. How long things stay on screen, how subsequent POSTs are merged for display, what those region fields do if you put different data in them, etc. are all things we can find through experimentation so having them here will be a good reference for whoever digs in. I imagine YouTube implements a lot of their own guidelines and structure for how everything works so we don't need to worry too much about the fine details but we'll have to see.

If we end up not being able to figure something out then we can narrow down our questions and get a succinct (long texts usually don't encourage a good response in these situations) email written to send to the email they list for clarification: yt-live-caps-support@google.com