Open ZhijingEu opened 2 years ago
Thank you @ZhijingEu - this is certainly helpful, but I think the real solution to the problem is for webvtt-py to be much more forgiving in the way it parses VTTs. I don't know what the precise VTT spec says about time formats, but judging by the fact that mainstream sources like, e.g., the Microsoft Teams autogenerated transcripts, exhibit this behaviour, it would behoove webvtt-py to accommodate this relatively trivial change.
I'll hopefully open a PR for that soon.
https://www.w3.org/TR/webvtt1/#webvtt-timestamp
Exactly 3 digits are required by the standard. Else things like VideoJS will stop execution. Only optional thing is hours mark and only if 0
Basically the Teams, AWS and many services are breaking the standard and instead of getting it fixed there - everyone is doing their own hacks to handle broken things.
I get that, but pragmatically speaking, it's probably best for tools to be as permissive as they reasonably can, especially for spec violations that are widely common in the wild. Users of webvtt-py likely can't choose to just consume transcripts from some other source, but they can choose to just use some other VTT parser.
I've found this works as a temporary fix for Teams time formats.
import io
from webvtt import structures
from webvtt.parsers import WebVTTParser
import re
structures.TIMESTAMP_PATTERN = re.compile('(\d+)?:?(\d{1,2}):(\d{1,2})[.,](\d{1,3})')
WebVTTParser.TIMEFRAME_LINE_PATTERN = re.compile(r'\s*((?:\d+:)?\d{1,2}:\d{1,2}.\d{1,3})\s*-->\s*((?:\d+:)?\d{1,2}:\d{1,2}.\d{1,3})')
from webvtt import WebVTT
for caption in WebVTT.read_buffer(io.StringIO(tcontent)):
print(caption.start)
print(caption.end)
print(caption.text)
Ah yeah, clever - you can just monkey-patch those variables directly in the module.
... still would be nicer not to have to do that, though 😅
Hey everyone - I just wanted to share a quick fix for a problem where I noticed webvtt-py does not do well when timestamps are in the format of 0:1:5.2 as opposed to 00:01:05:002.
I have written a piece of regex find replace to convert the format that I've shared over here on this repo https://github.com/ZhijingEu/VTT_File_Cleaner and also accompanied by a video tutorial https://www.youtube.com/watch?v=iZ0pOSL8JZw
Hope this helps someone out there in the future facing this issue