joegesualdo / vtt-to-json

Convert WebVTT file to JSON
29 stars 5 forks source link

Doesn't correctly parse some youtube vtt subtitles #1

Open joegesualdo opened 7 years ago

joegesualdo commented 7 years ago

Problem

Doesn't correctly parse

Why

Instead of separating words by the caption time, some youtube vtt subtitles separates by syllables.

Example

Video: https://www.youtube.com/watch?v=xQPBPB8UDyQ

WEBVTT
Kind: captions
Language: en

00:00:05.472 --> 00:00:07.740 align:start position:0% line:0%

OF<00:00:05.505><c>FI</c><00:00:05.538><c>CE</c><00:00:05.572><c>RS</c><00:00:05.605><c>.</c>

00:00:07.740 --> 00:00:07.807 align:start position:0% line:0%
OFFICERS.

00:00:07.807 --> 00:00:12.412 align:start position:0% line:0%
OFFICERS.
SC<00:00:07.840><c>RI</c><00:00:07.873><c>PT</c><00:00:07.907><c>UR</c><00:00:07.940><c>E</c><00:00:08.441><c> T</c><00:00:08.474><c>EL</c><00:00:08.507><c>LS</c><00:00:08.708><c> U</c><00:00:08.741><c>S</c><00:00:09.075><c> T</c><00:00:09.108><c>HA</c><00:00:09.141><c>T</c><00:00:09.309><c> I</c><00:00:09.342><c>N</c><00:00:09.576><c> O</c><00:00:09.609><c>UR</c><00:00:11.678><c> </c>

00:00:12.412 --> 00:00:12.479 align:start position:0% line:0%
SCRIPTURE TELLS US THAT IN OUR 

00:00:12.479 --> 00:00:16.349 align:start position:0% line:0%
SCRIPTURE TELLS US THAT IN OUR 
SU<00:00:12.512><c>FF</c><00:00:12.545><c>ER</c><00:00:12.579><c>IN</c><00:00:12.612><c>GS</c><00:00:12.645><c>,</c><00:00:12.746><c> T</c><00:00:12.779><c>HE</c><00:00:12.812><c>RE</c><00:00:14.147><c> I</c><00:00:14.180><c>S</c><00:00:15.515><c> </c><00:00:15.782><c>GL</c><00:00:15.815><c>OR</c><00:00:15.848><c>Y.</c>

00:00:16.349 --> 00:00:16.416 align:start position:0% line:0%
SUFFERINGS, THERE IS GLORY.

00:00:16.416 --> 00:00:22.254 align:start position:0% line:0%
SUFFERINGS, THERE IS GLORY.
BE<00:00:16.449><c>CA</c><00:00:16.482><c>US</c><00:00:16.516><c>E</c><00:00:16.683><c> W</c><00:00:16.716><c>E</c><00:00:17.283><c> K</c><00:00:17.316><c>NO</c><00:00:17.349><c>W</c><00:00:18.485><c> T</c><00:00:18.518><c>HA</c><00:00:18.551><c>T</c><00:00:19.085><c> S</c><00:00:19.118><c>UF</c><00:00:19.151><c>FE</c><00:00:19.185><c>RI</c><00:00:19.218><c>NG</c><00:00:22.188><c> </c>

00:00:22.254 --> 00:00:22.321 align:start position:0% line:0%
BECAUSE WE KNOW THAT SUFFERING 

00:00:22.321 --> 00:00:27.827 align:start position:0% line:0%
BECAUSE WE KNOW THAT SUFFERING 
PR<00:00:22.354><c>OD</c><00:00:22.388><c>UC</c><00:00:22.421><c>ES</c><00:00:22.856><c> </c><00:00:26.693><c>PE</c><00:00:26.726><c>RS</c><00:00:26.759><c>EV</c><00:00:26.793><c>ER</c><00:00:26.826><c>AN</c><00:00:26.859><c>CE</c><00:00:26.893><c>.</c>

00:00:27.827 --> 00:00:27.894 align:start position:0% line:0%
PRODUCES PERSEVERANCE.

00:00:27.894 --> 00:00:28.561 align:start position:0% line:0%
PRODUCES PERSEVERANCE.
PE<00:00:27.927><c>RS</c><00:00:27.960><c>E</c>

00:00:28.561 --> 00:00:28.628 align:start position:0% line:0%
PERSE

00:00:28.628 --> 00:00:29.796 align:start position:0% line:0%
PERSE
PE<00:00:28.661><c>RS</c><00:00:28.694><c>EV</c><00:00:28.728><c>ER</c><00:00:28.761><c>AN</c><00:00:28.795><c>CE</c><00:00:28.828><c>,</c><00:00:28.861><c> </c><00:00:29.596><c>CH</c><00:00:29.629><c>AR</c><00:00:29.662><c>AC</c><00:00:29.696><c>TE</c><00:00:29.729><c>R.</c>

00:00:29.796 --> 00:00:29.862 align:start position:0% line:0%
PERSEVERANCE, CHARACTER.

00:00:29.862 --> 00:00:36.669 align:start position:0% line:0%
PERSEVERANCE, CHARACTER.
AN<00:00:29.896><c>D</c><00:00:30.663><c> </c><00:00:32.599><c>CH</c><00:00:32.632><c>AR</c><00:00:32.665><c>AC</c><00:00:32.699><c>TE</c><00:00:32.732><c>R,</c><00:00:32.799><c> </c><00:00:33.867><c>HO</c><00:00:33.900><c>PE</c><00:00:33.933><c>.</c>
zoutepopcorn commented 5 years ago

I have the same problem, the time is not working:

index.js

const vttToJson = require("vtt-to-json")
const vttString = `
00:00:40.790 --> 00:00:40.800 align:start position:0%
soms bij andere ploegen helemaal niets

00:00:40.800 --> 00:00:43.160 align:start position:0%
soms bij andere ploegen helemaal niets
iets<00:00:41.010><c> ik</c><00:00:41.399><c> kan</c><00:00:41.550><c> daar</c><00:00:41.760><c> enorm</c><00:00:42.210><c> van</c><00:00:42.390><c> genieten</c>
`;

vttToJson(vttString)
    .then((result) => {
        console.log(result.length)
        for(const component of result ) {
            console.log(component)
        }
    });

will output:

const output = 
{ start: 40790,
  end: 40800,
  part: 'soms bij andere ploegen helemaal niets',
  words:
   [ { word: 'soms', time: undefined },
     { word: 'bij', time: undefined },
     { word: 'andere', time: undefined },
     { word: 'ploegen', time: undefined },
     { word: 'helemaal', time: undefined },
     { word: 'niets', time: undefined } ] }
{ start: 40790,
  end: 40800,
  part: '',
  words: [ { word: '', time: undefined } ] }