wa3dbk / ScribeSalad

A collection of YouTube videos transcripts : Podcasts (Joe Rogan Experience, Tim Ferris, Jocko podcast, ..), lectures (YaleCourses, MIT lectures, Jordan B. Peterson talks, ..). A big transcripts salad spanning history, geography, science, politics, film making and more.
GNU General Public License v3.0
75 stars 19 forks source link

Some yt_auto captions all appear within first few seconds of webVTT timestamps #3

Open richieM opened 1 year ago

richieM commented 1 year ago

For some english yt_auto transcripts, the entire transcript incorrectly appears within the first few seconds in the webVTT

Some examples:

https://github.com/wa3dbk/ScribeSalad/blob/master/transcripts/en/AndrewHuberman/yt_auto/DTCmprPCDqc.en.vtt https://github.com/wa3dbk/ScribeSalad/blob/master/transcripts/en/8NewsNowLasVegas/yt_auto/-0IjUVDKY10.en.vtt https://github.com/wa3dbk/ScribeSalad/blob/master/transcripts/en/GlobalNews/yt_auto/-2Yl-90jzi0.en.vtt

Seems to be a fairly widespread issue.

BTW, thanks for creating this repo, it's very useful :)

wa3dbk commented 1 year ago

Good observation ! Subtitles generated automatically by YouTube (the ones in "yt_auto") are often misaligned, empty or filled with useless tags and symbols (such as [music] and (♪♪)).

I plan on cleaning-up these subtitles (as much as possible) and re-aligning the ones where the entire transcript appears within the first few seconds. This process might take some time (due to the amount of data that needs to be processed).

I'll probably start with videos in English and create a parallel "yt_auto_norm" or "yt_auto_realign" directory containing the new cleaned-up and re-aligned transcripts and work on the remaining languages later.

This process would make the entire repo usable for people interested in ASR (automatic speech recognition) or any kind of search/indexing.