tvkitchen / appliances

A one stop shop for official TV Kitchen Appliances
GNU Lesser General Public License v3.0
3 stars 0 forks source link

Improve caption extraction to handle corrections #133

Closed slifty closed 3 years ago

slifty commented 3 years ago

Task

Description

In order to get letter-by-letter we use the CCExtractor rollup feature, which extracts each new caption line as they come in.

This works well for some captions, but there are cases where captions are being corrected it can make a huge mess:

image

We need to figure out how to detect when this kind of correction is happening and either (A) ignore the corrections or (B) handle them and emit them in a more effective way effectively (e.g. maybe there is a data flag as part of the emit that flags the new atom as a correction).

Relevant Resources / Research

None yet

slifty commented 3 years ago

I bought the closed captioning handbook and read through it. There is a section about control characters which exist in the CC spec. These allow a broadcaster to move the cursor back in a given rollup line, or wipe out all characters after a certain point in the buffer.

CCExtractor knows how to process these characters, but so far I don't see any indication that it will emit those character. Really the issue is that CCExtractor doesn't directly handle the "stream of individual characters" use case. It does the magic behind the scenes, and what TV Kitchen has done is take all of that processing and then go back to simulate a rollup.

This is all a long way of saying that I believe what we need to do is ditch the simulated rollup, and instead just have the caption extractor emit payloads one line at a time. I think it can still break the lines into ATOM payloads, which will have more backwards compatibility down the line if we decide to ditch CCExtractor and parse caption streams directly.

By doing this I believe we will fix this bug, and possibly #139 as well