Closed simonw closed 2 years ago
🚫 No URL found in issue body.
It looks to me like I might be able to get away with simple ignoring blocks in this file which have the <c>
stuff in them.
Might not quite be that simple:
00:02:25.760 --> 00:02:27.830 align:start position:0%
luckily for me i hadn't gone to
university<00:02:26.160><c> yet</c><00:02:26.319><c> so</c><00:02:26.480><c> when</c><00:02:26.720><c> the</c><00:02:26.879><c> entire.com</c>
00:02:27.830 --> 00:02:27.840 align:start position:0%
university yet so when the entire.com
00:02:27.840 --> 00:02:28.470 align:start position:0%
university yet so when the entire.com
industry
00:02:28.470 --> 00:02:28.480 align:start position:0%
industry
00:02:28.480 --> 00:02:30.550 align:start position:0%
industry
crashed<00:02:28.959><c> i</c><00:02:29.120><c> could</c><00:02:29.520><c> go</c><00:02:29.760><c> and</c><00:02:29.920><c> shelter</c><00:02:30.400><c> in</c>
00:02:30.550 --> 00:02:30.560 align:start position:0%
crashed i could go and shelter in
00:02:30.560 --> 00:02:31.670 align:start position:0%
crashed i could go and shelter in
academia<00:02:31.200><c> for</c>
00:02:31.670 --> 00:02:31.680 align:start position:0%
academia for
If I ignore any blocks that contain a <c>
I still get this:
00:02:27.830 --> 00:02:27.840 align:start position:0%
university yet so when the entire.com
00:02:27.840 --> 00:02:28.470 align:start position:0%
university yet so when the entire.com
industry
00:02:28.470 --> 00:02:28.480 align:start position:0%
industry
00:02:30.550 --> 00:02:30.560 align:start position:0%
crashed i could go and shelter in
00:02:31.670 --> 00:02:31.680 align:start position:0%
academia for
This bit gets duplicated twice:
university yet so when the entire.com
industry
Maybe I can remove duplicate lines though?
This SORT of works:
import webvtt
path = "/Users/simon/Dropbox/Development/transcribe-videos/16/auto/Simon Willison: My career in side projects and open source [wqjye4QnWK0].en.vtt"
captions = webvtt.read(path)
prev_line = None
for c in captions:
if any('<c>' in l for l in c._lines):
continue
for line in c._lines:
if not line.strip():
continue
if prev_line != line:
print(line)
prev_line = line
Started a new repo: https://github.com/simonw/webvtt-to-json
Released first version of this: https://pypi.org/project/webvtt-to-json/
Need a similar tool for:
vtt in particular is tricky: