Open richieM opened 1 year ago
Good observation ! Subtitles generated automatically by YouTube (the ones in "yt_auto") are often misaligned, empty or filled with useless tags and symbols (such as [music] and (♪♪)).
I plan on cleaning-up these subtitles (as much as possible) and re-aligning the ones where the entire transcript appears within the first few seconds. This process might take some time (due to the amount of data that needs to be processed).
I'll probably start with videos in English and create a parallel "yt_auto_norm" or "yt_auto_realign" directory containing the new cleaned-up and re-aligned transcripts and work on the remaining languages later.
This process would make the entire repo usable for people interested in ASR (automatic speech recognition) or any kind of search/indexing.
For some english yt_auto transcripts, the entire transcript incorrectly appears within the first few seconds in the webVTT
Some examples:
https://github.com/wa3dbk/ScribeSalad/blob/master/transcripts/en/AndrewHuberman/yt_auto/DTCmprPCDqc.en.vtt https://github.com/wa3dbk/ScribeSalad/blob/master/transcripts/en/8NewsNowLasVegas/yt_auto/-0IjUVDKY10.en.vtt https://github.com/wa3dbk/ScribeSalad/blob/master/transcripts/en/GlobalNews/yt_auto/-2Yl-90jzi0.en.vtt
Seems to be a fairly widespread issue.
BTW, thanks for creating this repo, it's very useful :)