glut23 / webvtt-py

Read, write, convert and segment WebVTT caption files in Python.
MIT License
188 stars 56 forks source link

MalformedCaptionError: Invalid Time Format #44

Open ZhijingEu opened 1 year ago

ZhijingEu commented 1 year ago

Hey everyone - I just wanted to share a quick fix for a problem where I noticed webvtt-py does not do well when timestamps are in the format of 0:1:5.2 as opposed to 00:01:05:002.

I have written a piece of regex find replace to convert the format that I've shared over here on this repo https://github.com/ZhijingEu/VTT_File_Cleaner and also accompanied by a video tutorial https://www.youtube.com/watch?v=iZ0pOSL8JZw

Hope this helps someone out there in the future facing this issue

apetresc commented 1 year ago

Thank you @ZhijingEu - this is certainly helpful, but I think the real solution to the problem is for webvtt-py to be much more forgiving in the way it parses VTTs. I don't know what the precise VTT spec says about time formats, but judging by the fact that mainstream sources like, e.g., the Microsoft Teams autogenerated transcripts, exhibit this behaviour, it would behoove webvtt-py to accommodate this relatively trivial change.

I'll hopefully open a PR for that soon.

filipsworks commented 1 year ago

https://www.w3.org/TR/webvtt1/#webvtt-timestamp

Exactly 3 digits are required by the standard. Else things like VideoJS will stop execution. Only optional thing is hours mark and only if 0

Basically the Teams, AWS and many services are breaking the standard and instead of getting it fixed there - everyone is doing their own hacks to handle broken things.

apetresc commented 1 year ago

I get that, but pragmatically speaking, it's probably best for tools to be as permissive as they reasonably can, especially for spec violations that are widely common in the wild. Users of webvtt-py likely can't choose to just consume transcripts from some other source, but they can choose to just use some other VTT parser.

jrowen commented 1 year ago

I've found this works as a temporary fix for Teams time formats.

import io
from webvtt import structures
from webvtt.parsers import WebVTTParser
import re

structures.TIMESTAMP_PATTERN = re.compile('(\d+)?:?(\d{1,2}):(\d{1,2})[.,](\d{1,3})')
WebVTTParser.TIMEFRAME_LINE_PATTERN = re.compile(r'\s*((?:\d+:)?\d{1,2}:\d{1,2}.\d{1,3})\s*-->\s*((?:\d+:)?\d{1,2}:\d{1,2}.\d{1,3})')

from webvtt import WebVTT

for caption in WebVTT.read_buffer(io.StringIO(tcontent)):
    print(caption.start)
    print(caption.end)
    print(caption.text)
apetresc commented 1 year ago

Ah yeah, clever - you can just monkey-patch those variables directly in the module.

... still would be nicer not to have to do that, though 😅