llimllib / yt-transcribe

Transcribe a youtube video into an easily readable HTML file
The Unlicense
12 stars 1 forks source link

option to use CC #2

Open llimllib opened 2 months ago

llimllib commented 2 months ago

Most youtube auto-generated CCs suck, but some videos have manually attached high-quality CCs, and somebody might want to use them instead.

yt-transcribe -cc <video_url>

llimllib commented 2 months ago

If you download the captions, they get downloaded in a format that strides the boundaries we're placing. There are a few output formats, but vtt is the default and is representative:

WEBVTT
Kind: captions
Language: en-US

00:00:00.166 --> 00:00:01.766
Hey, I'm Sam
from Prismic. I'm here

00:00:01.766 --> 00:00:03.333
with Rich
Harris, creator of Svelte.

00:00:03.333 --> 00:00:05.800
And Rich is explaining to
me how you can get Rich quick by

00:00:05.800 --> 00:00:07.033
creating your
own JavaScript framework.

00:00:08.166 --> 00:00:09.233
Thanks for joining me, Rich.

Without the benefit of the nicely segmented transcript that whisper provides us, we have a couple problems:

Probably the answer to supporting youtube and whisper is to write functions that return arrays of transcript segments, one for each in the video. So above, we'd parse the VTT file for frames that match the [0-5] segment, then frames that match [5-10], and so on and so forth, and return an array (which must have empty elements for empty segments).

llimllib commented 2 months ago

A sample video that has good manually-generated subtitles, useful for testing, is: https://www.youtube.com/watch?v=i-BkN3rTK0Q

It's also helpful to know that --skip-download --write-sub will only download subs if they are manually added, not automatically generated. Ex:

 yt-dlp --skip-download --write-subs -o "manual_subs_only.vtt" 'https://www.youtube.com/watch?v=i-BkN3rTK0Q'  

If you try to use that on a video that only has automatic subs, you get:

$ yt-dlp --skip-download --write-subs -o "bulls.vtt" 'https://www.youtube.com/watch?v=lyJ6GyC4Yng' 
[info] There are no subtitles for the requested languages

$ echo $?
0

As you can see, unfortunately it returns success so we'd have to check for the presence or absence of bulls.vtt.en-US.vtt. It's trickier because it inserts the language in there! Maybe I can figure out how to avoid that.