Open lostdistance opened 4 years ago
Synchronization is not achieved by aligning to frames, but by times. Re-sampling the frame rate of the video makes no difference to the overall timing; an utterance that starts at 2s and lasts for 3s still ends at 5s, no matter how many frames of video were played. So I do not understand your conclusion; there is no drift.
He's talking about a simple speedup and slowdown, such as is done with NTSC and PAL conversions, for example, not about frame interpolation methods. The same number of frames is displayed faster or slower, meaning that all individual frames will show up earlier or later.
Ironically, aligning to timecodes instead of frames, as you mentioned, is the cause of this issue.
If you display the video at the wrong rate, then you need to adjust timestamps. Isn't that obvious?
It is, but the problem here is that, because of all the legacy support the old broadcasters and content producers are saddled with, they are forced to distribute differently encoded videos in different markets (i.e., pre-encoded at the wrong rate), so there will necessarily be two sets of subtitles and videos even though their content is the same.
It's not hard to imagine the cases where users will somehow end up with a non-matching set of video and subtitles. The simple subtitle delay options that's present in most players and is pretty intuitive to use for laymen, then will not be able to help them. They would need to know about the concept of framerates, find some converter and fix the problem manually.
Also, the subtitle files are used in all kinds of tools in the content-generating pipeline; they're not just for distributing content. Unless there's some way to encode metadata inside them to help with automation (such as this framerate conversion), it will have to be tracked externally, which is error-prone, so I can see the producers eventually just inventing each their own ad hoc NOTE
syntaxes to solve the same problem.
I've just thought of an even more obvious example. Even the 24 fps and 23.976 fps, which are for all intents and purposes the same framerate, and those not in the know often treat them as same, will drift by almost 2 seconds in the first half an hour, and by almost 4 seconds in one hour.
Considering that cameras film at both framerates, some physical and broadcast formats support only one or the other (23.976 for NTSC DVDs and even current HD broadcasts in historically NTSC countries; both for Blu-rays, which nevertheless usually use 24 fps in Europe), and each digital platform, film festival and whatnot seems to have its own preference, it's really a miracle that mistakes don't happen constantly.
So if you use VTT as the master format, it would be pretty handy to have this metadata available.
There have been plans for a standardised metadata section at the start of a WebVTT file, including things like language, kind and aspect-ratio (see e.g. https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html#metadata-xds in relation to 608 metadata).
This does not have to be part of the WebVTT spec - it could be a schema.org schema that is referred to in a NOTE section at the start, a bit like it's being done with HTML.
It's not metadata. What you're asking for is the ability to apply a scaling factor to the VTT timestamps; e.g. "please multiply all timestamps by 24/23.976". This is a normative, not informative statement.
I guess you're right. It could be something like a timeStretch variable that is by default 1.0, but could be changed to being lower or larger than 1.
I understand that WebVTT timecode (unlike some other timecodes in use) is not frame-based. It is time-only. If (like what TTML permits) the timecode can include frame numbers, then (like TTML), framerate would need to specified.
Am I missing some variant in WebVTT that allows timecode specified in frames? If not, then I agree with Dave that something is wrong here. WebVTT time drift relative to video/audio indicates to me that something is broken in implementations.
So if you use VTT as the master format, it would be pretty handy to have this metadata available.
@aaaxx @lostdistance It's well worth analysing the requirements you have for your master format and choosing the format that best meets those requirements. This area of authoring vs distribution is complex, and you've hit on one of the significant areas of complexity. Potentially you might find that e.g. TTML using smpte timebase is a better fit, with a downstream conversion to whatever distribution format you need. For example the EBU-TT profile of TTML1 (EBU Tech3350), and the EBU-TT Metadata schema (EBU Tech3390) support some of the more complex exchange use cases including frame rate variance. Of course you might also find that such a solution introduces other complexities and that there's a balance to be made between competing needs.
WebVTT doesn't allow timing to be specified in frames. It only has representations of hours, minutes, seconds, and milliseconds (https://w3c.github.io/webvtt/#webvtt-timestamp).
Providing a framerate in the WebVTT file, as metadata or however, does make sense from a certain perspective but I think this is slightly tricky considering WebVTT's usage in the web platform. The media element doesn't expose a framerate that is accessible to developers so this would never be able to be polyfilled without external knowledge. Only native WebVTT implementations would be able to account for this. A timestretch option would make it easier to implement but I think it would make it a bit trickier for human readers and it would still only allow the file to be associated with one particular video. Unless this is a new attribute on the track element as an out-of-bound info? A person look at a value of 1.001 wouldn't necessarily know that the file's timings were written for 23.97fps but is being display in 24fps. Looking into TTML, seems like it has both frameRate and frameRateMultiplier, which seems to solve some of my issues with the above. Though, then you need to keep track of two properties.
Either way, we would need to update both WebVTT and HTML. Not that we shouldn't pursue it just because but with our limited resources (namely just me) it does make it harder.
To me, the safest approach right now seems to just be including the framerate as metadata so that people can easily tell what framerate the WebVTT file was authored against. Then, it could be expanded to include other capabilities should we need/want them.
The more I think about it, the more I think there are two things to consider:
1/ the framerate that the WebVTT file was authored against Is this actually relevant? I mean: we're authoring against a video file that simply plays back in standard time. So, it doesn't really matter what the video's framerate was.
2/ the relationship between the WebVTT file and the video file in relation to playback speed Typically, we have kept such information out of WebVTT. It was instead added to attributes of the <track> element.
I wonder if this is actually more a problem of the video file rather than of the WebVTT file? Could the problem of relative synchronisation between the two be dealt with by the player knowing more about the video file's "stretched" or "compressed" framerate?
For 1, I guess it depends on how much we care about improperly re-sampled videos. If the original video that the WebVTT was written against is 24fps but then displayed at 23.97 or 25fps or 30fps, the timing of the cues will not match up properly. Knowing the fps of the video associated with the webvtt file will let us know whether they will get out of sync or not. With proper re-sampling, the timing should stay the same.
This definitely could be solved by the player knowing about the two frame rates and adjusting things accordingly.
Indeed, I guess another question is why you wouldn't author the new video file with the correct timestamps, so as to retain timing. Nothing in MP4 stops you assigning a strange timescale to the video track, or weird-looking timestamps.
Another issue with just having it as metadata, while helpful, is that stale and incorrect metadata is worse than no metadata.
For reference, the EBU-TT-D profile v1 of TTML included ebuttm:authoredFrameRate
and ebuttm:authoredFrameRateMultiplier
for a similar reason, with the explanation that its presence is informative only and
should only be used for conformance and quality checking purposes for example to determine when the video playback rate of the related media object may differ from the expected playback rate when the subtitle document was authored (e.g. 24fps video played at 25fps by speeding up the playback rate by 1.0466%).
So there is precedence for this kind of metadata.
I often play media which does not include audio tracks or subtitles in my preferred language. My solution is to acquire standalone subtitle files from external sources.
However, one omission from the SRT file format (used as the basis for WebVTT) is the presumed video frame rate. Without this, rendering software is provided no mechanism to automatically adjust timestamps as needed. Such a need arises when a subtitle file authored for one frame rate (for example, 24 fps) is to be rendered with media that has undergone standards conversion to another frame rate (for example, 25 fps). Without adjustment, the subtitles will progressively lose synchronization with the media.
WebVTT defines timestamps but does not define the presumed video frame rate required to automatically adjust timestamps as needed.
Please define the presumed video frame rate.