tvkitchen / appliances

A one stop shop for official TV Kitchen Appliances
GNU Lesser General Public License v3.0
3 stars 0 forks source link

Cleaning up SRTs #110

Open slifty opened 3 years ago

slifty commented 3 years ago

Task

Description

SRTs sometimes have lots of whitespace (which is part of the caption data) but which we can clear up in our SRT payloads.

00:14:41,825 --> 00:14:44,444
         tuberculosis,      including infections,

00:14:44,444 --> 00:14:44,444

00:14:44,444 --> 00:14:46,029
        nervous system and    lymphoma, other cancers,

00:14:46,029 --> 00:14:47,581
     and allergic reactions

I think it would be reasonable to (A) trim the white space at the front and end and (B) convert \s* to just a single space.

Note this is not talking about changing the captions -- this is just for the SRT appliance.

slifty commented 3 years ago

We should also not emit an SRT if it holds no content.

slifty commented 3 years ago

The whitespace issue could also be handled by the caption extractor, since that is where it is introduced.

I believe that whitespace may be a byproduct of position estimation / screen rendering. The caption extractor really has been framed as a chance to extract ascii / transcripts from a stream.

The reason to do the fix in SRT is that the SRT absolutely doesn't want it, but it's possible that there would be a use case for downstream caption extraction appliance data that wants the raw caption data as it was originally encoded.