Normalize and de-dupe common transcript formats

simonw commented 2 years ago

vtt in particular is tricky:

WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:01.829 align:start position:0%

my<00:00:00.160><c> career</c><00:00:00.480><c> in</c><00:00:00.640><c> side</c><00:00:00.880><c> projects</c><00:00:01.280><c> and</c><00:00:01.520><c> open</c>

00:00:01.829 --> 00:00:01.839 align:start position:0%
my career in side projects and open

00:00:01.839 --> 00:00:04.550 align:start position:0%
my career in side projects and open
source<00:00:02.240><c> basically</c><00:00:02.800><c> this</c><00:00:03.040><c> is</c><00:00:03.199><c> so</c><00:00:03.360><c> i've</c><00:00:03.600><c> been</c>

00:00:04.550 --> 00:00:04.560 align:start position:0%
source basically this is so i've been

00:00:04.560 --> 00:00:07.349 align:start position:0%
source basically this is so i've been
knocking<00:00:04.880><c> around</c><00:00:05.279><c> with</c><00:00:05.680><c> um</c><00:00:06.399><c> uh</c><00:00:06.799><c> side</c><00:00:07.040><c> projects</c>

00:00:07.349 --> 00:00:07.359 align:start position:0%
knocking around with um uh side projects

github-actions[bot] commented 2 years ago

🚫 No URL found in issue body.

simonw commented 2 years ago

It looks to me like I might be able to get away with simple ignoring blocks in this file which have the <c> stuff in them.

simonw commented 2 years ago

Might not quite be that simple:

00:02:25.760 --> 00:02:27.830 align:start position:0%
luckily for me i hadn't gone to
university<00:02:26.160><c> yet</c><00:02:26.319><c> so</c><00:02:26.480><c> when</c><00:02:26.720><c> the</c><00:02:26.879><c> entire.com</c>

00:02:27.830 --> 00:02:27.840 align:start position:0%
university yet so when the entire.com

00:02:27.840 --> 00:02:28.470 align:start position:0%
university yet so when the entire.com
industry

00:02:28.470 --> 00:02:28.480 align:start position:0%
industry

00:02:28.480 --> 00:02:30.550 align:start position:0%
industry
crashed<00:02:28.959><c> i</c><00:02:29.120><c> could</c><00:02:29.520><c> go</c><00:02:29.760><c> and</c><00:02:29.920><c> shelter</c><00:02:30.400><c> in</c>

00:02:30.550 --> 00:02:30.560 align:start position:0%
crashed i could go and shelter in

00:02:30.560 --> 00:02:31.670 align:start position:0%
crashed i could go and shelter in
academia<00:02:31.200><c> for</c>

00:02:31.670 --> 00:02:31.680 align:start position:0%
academia for

If I ignore any blocks that contain a <c> I still get this:

00:02:27.830 --> 00:02:27.840 align:start position:0%
university yet so when the entire.com

00:02:27.840 --> 00:02:28.470 align:start position:0%
university yet so when the entire.com
industry

00:02:28.470 --> 00:02:28.480 align:start position:0%
industry

00:02:30.550 --> 00:02:30.560 align:start position:0%
crashed i could go and shelter in

00:02:31.670 --> 00:02:31.680 align:start position:0%
academia for

This bit gets duplicated twice:

university yet so when the entire.com
industry

Maybe I can remove duplicate lines though?

simonw commented 2 years ago

This SORT of works:

import webvtt
path = "/Users/simon/Dropbox/Development/transcribe-videos/16/auto/Simon Willison： My career in side projects and open source [wqjye4QnWK0].en.vtt"
captions = webvtt.read(path)

prev_line = None
for c in captions:
    if any('<c>' in l for l in c._lines):
        continue
    for line in c._lines:
        if not line.strip():
            continue
        if prev_line != line:
            print(line)
        prev_line = line

simonw commented 2 years ago

Started a new repo: https://github.com/simonw/webvtt-to-json

simonw commented 2 years ago

Released first version of this: https://pypi.org/project/webvtt-to-json/

simonw commented 2 years ago

Need a similar tool for:

https://github.com/simonw/transcribe-videos/blob/8c11be27f1aa5cbdebc492d82cfa7d64b5f491a3/17/auto/Simon%20Willison%EF%BC%9A%20My%20career%20in%20side%20projects%20and%20open%20source%20%5Bwqjye4QnWK0%5D.en.ttml#L13-L21

https://en.wikipedia.org/wiki/Timed_Text_Markup_Language

simonw commented 2 years ago

Shipped https://pypi.org/project/ttml-to-json/

simonw / action-transcription-prototype

Normalize and de-dupe common transcript formats #21