glut23 / webvtt-py

Read, write, convert and segment WebVTT caption files in Python.
MIT License
188 stars 56 forks source link

Better parsing of srt subtitles to remove double newlines/breaks #31

Open shubhank008 opened 4 years ago

shubhank008 commented 4 years ago

I am getting Malformed Exception in some of my srt files due to them having weird double line breaks which breaks your parser I think.
I tried fixing it by replacing 2 or 3 linebreaks with a single linebreak but it wasn't as accurate as regex or a proper approach would be, would appreciate if you can add it.

Example subtitle (part of it)

00:01:10.733 --> 00:01:12.272
Aren't you excited?

00:01:14.143 --> 00:01:17.942
Let's find another place 

to hide out this year,

and play video 
games until it blows over.

00:01:17.943 --> 00:01:19.942

That'll get us through half a day, no problem.
shubhank008 commented 4 years ago

Another example

10
00:02:05,988 --> 00:02:10,987
CHAPITRE 12

BAPTÊME ET PARADIS DES DIEUX

11
00:02:13,278 --> 00:02:14,367
Je vois…

12
00:02:14,488 --> 00:02:17,747
Tu vas arrêter de travailler
pour M. Benno ?

13
00:02:19,368 --> 00:02:21,497
Oui. J’en ai parlé à Otto.
arqtiq commented 4 years ago

I'm also having this issue right now, torned between writing my own converter or pre-patching srt file to get rid of these line breaks

shubhank008 commented 4 years ago

I'm also having this issue right now, torned between writing my own converter or pre-patching srt file to get rid of these line breaks

I ended up writing a pre-patch to sanitize my srt files before reading them with webvtt, used a mix of both replace and regex to remove linebreaks and then keep on expanding that regex based on any other format mess I face

kicks66 commented 5 months ago

hi @shubhank008 - could you share your replace / regex that you used? running into the same issues!