Frame accurate seeking of HTML5 MediaElement

tidoust commented 6 years ago

I've heard a couple of companies point out that one of the problems that makes it hard (at least harder than it could be) to do post-production of videos in Web browsers is that there is no easy way to process media elements on a frame by frame basis, whereas that is the usual default in Non-Linear Editors (NLE).

The currentTime property takes a time, not a frame number or an SMPTE timecode. Conversion from/to times to/from frame numbers is doable but supposes one knows the framerate of the video, which is not exposed to Web applications (a generic NLE would thus not know about it). Plus that framerate may actually vary over time.

Also, internal rounding of time values may mean that one seeks to the end of the previous frame instead of the beginning of a specific video frame.

Digging around, I've found a number of discussions and issues around the topic, most notably:

An long thread from 2011 on Frame accuracy / SMPTE, which led to improvements in the precision of seeks in browser implementations: https://lists.w3.org/Archives/Public/public-whatwg-archive/2011Jan/0120.html
A list of use cases from 2012 for seeking to specific frames. Not sure if these use cases remain relevant today: https://www.w3.org/Bugs/Public/show_bug.cgi?id=22678
A question from 2013 on whether there was interest to expose "versions of currentTime, fastSeek(), duration, and the TimeRanges accessors, in frames, for video data": https://www.w3.org/Bugs/Public/show_bug.cgi?id=8278#c3
A proposal from 2016 to add a rational time value for seek() to solve rounding issues (still open as of June 2018): https://github.com/whatwg/html/issues/609

There have probably been other discussions around the topic.

I'm raising this issue to collect practical use cases and requirements for the feature, and gauge interest from media companies to see a solution emerge. It would be good to precisely identify what does not work today, what minimal updates to media elements could solve the issue, and what these updates would imply from an implementation perspective.

palemieux commented 6 years ago

There have probably been other discussions around the topic.

Yes. Similar discussions happened during the MSE project: https://www.w3.org/Bugs/Public/show_bug.cgi?id=19676

chrisn commented 6 years ago

There's some interesting research here, with a survey of current browser behaviour.

The current lack of frame accuracy effectively closes off entire fields of possibilities from the web, such as non-linear video editing, but it also has unfortunate effects on things as simple as subtitle rendering.

jpiesing commented 6 years ago

I should also mention that there is some uncertainty about the precise meaning of currentTime - particularly when you have a media pipeline where the frame/sample coming out of the end may be 0.5s further along the media timeline than the ones entering the media pipeline. Some people think currentTime reflects what is coming out of the display/speakers/headphones. Some people think it should reflect the time were video and graphics are composited as this is easy to test and suits apps trying to sync graphics to video or audio. Simple implementations may re-use a time available in a media decoder.

Daiz commented 6 years ago

what minimal updates to media elements could solve the issue

Related to the matter of frame accuracy on the whole, one idea would be to add a new property to VideoElement called .currentFrameTime which would hold the presentation time value of the currently displayed frame. As mentioned in the research repository of mine (also linked above), .currentTime is not actually sufficient right now in any browser for determining the currently displayed frame even if you know the exact framerate of the video. .currentFrameTime could at least solve this particular issue, and could also be used for monitoring the exact screen refreshes when displayed frames change.

jpiesing commented 6 years ago

Related to the matter of frame accuracy on the whole, one idea would be to add a new property to VideoElement called .currentFrameTime which would hold the presentation time value of the currently displayed frame.

The currently displayed frame can be hard to determine, e.g. if the UA is running on a device without a display with video being output over HDMI or (perhaps) a remote playback scenario ( https://w3c.github.io/remote-playback/ ).

mfoltzgoogle commented 6 years ago

Remote playback cases are always going to be best effort to keep the video element in sync with the remote playback state. For video editing use cases, remote playback is not as relevant (except maybe to render the final output).

There are a number of implementation constraints that are going to make it challenging to provide a completely accurate instantaneous frame number or presentation timestamp in a modern browser during video playback.

The JS event loop will run in a different thread than the one painting pixels on the screen. There will be buffering and jitter in the intermediate thread hops.
The event loop often runs at a different frequency than the underlying video, so frames will span multiple loops.
Video is often decoded, painted, and composited asynchronously in hardware or software outside of the browser. There may not be frame-accurate feedback on the exact paint time of a frame.

Some estimates could be made based on knowing the latency of the downstream pipeline. It might be more useful to surface the last presentation timestamp submitted to the renderer and the estimated latency until frame paint.

It may also be more feasible to surface the final presentation timestamp/time code when a seek is completed. That seems more useful from a video editing use case.

Understanding the use cases here and what exactly you need know would help guide concrete feedback from browsers.

Daiz commented 6 years ago

One of the main use cases for me would be the ability to synchronize content changes outside video to frame changes in the video. As a simple example, the test case in the frame-accurate-ish repo shows this with the background color change. In my case the main thing would be the ability to accurate synchronize custom subtitle rendering with frame changes. Being even one or two screen refreshes off becomes a notable issue when you want to ensure subtitles appearing/disappearing with scene changes - even a frame or two of subtitles hanging on the screen after a scene change happens is very much notable and ugly to look at during playback.

mfoltzgoogle commented 6 years ago

It depends on the inputs to the custom subtitle rendering algorithm. How do you determine when to render a text cue?

Daiz commented 6 years ago

Currently, I'm using video.currentTime and doing calculations based on the frame rate to try to have cues appear/disappear when the displayed frame changes (which is the behavior I want to achieve). As mentioned before, this is not sufficient for frame-accurate rendering even if you know the exact frame rate of the video. There are ways to improve the accuracy with some non-standard properties (like video.mozPaintedFrames in Firefox), but even then the results aren't perfect.

jpiesing commented 6 years ago

It depends on the inputs to the custom subtitle rendering algorithm. How do you determine when to render a text cue?

Perhaps @palemieux could comment on how the imsc.js library handles this?

jpiesing commented 6 years ago

One of the main use cases for me would be the ability to synchronize content changes outside video to frame changes in the video. As a simple example, the test case in the frame-accurate-ish repo shows this with the background color change. In my case the main thing would be the ability to accurate synchronize custom subtitle rendering with frame changes. Being even one or two screen refreshes off becomes a notable issue when you want to ensure subtitles appearing/disappearing with scene changes - even a frame or two of subtitles hanging on the screen after a scene change happens is very much notable and ugly to look at during playback.

This highlights the importance of being clear what currentTime means as hardware-based implementations or devices outputting via HDMI may have several frames difference between the media time of the frame being output from the display and the frame being composited with graphics.

ingararntzen commented 6 years ago

With the timingsrc [1] library we are able to sync content changes outside the video with errors <10ms (less than a frame).

The library achieves this by

1) using an interpolated clock approximating currentTime (timingobject) 2) synchronizing video (mediasync) relative to a timing object (errors about 7ms) 3) synchronizing javascript cues (sequencer - based on setTimeout) relative to the same timing object (errors about 1ms)

This still leaves delays from DOM changes to on-screen rendering.

In any case, this should typically be sub-framerate sync.

This assumes that currentTime is a good representation of the reality of video presentation. If it isn't, but you know how wrong it is, you can easily compensate.

Not sure if this is relevant to the original issue, which I understood to be about accurate frame stepping - not sync during playback?

Ingar Arntzen

[1] https://webtiming.github.io/timingsrc/

nigelmegitt commented 6 years ago

how the imsc.js library handles this

@jpiesing I can't speak for @palemieux obviously but my understanding is that imsc.js does not play back video and therefore does not do any alignment; it merely identifies the times at which the presentation should change.

However it is integrated into the dash.js player which does need to synchronise the subtitle presentation with the media. I believe it uses Text Track Cues, and from what I've seen they can be up to 250ms late depending on when the Time Marches On algorithm happens to be run, which can be as infrequent as every 250ms, and in my experience often is.

As @Daiz points out, that's not nearly accurate enough.

palemieux commented 6 years ago

What @nigelmegitt said :)

What is needed is a means of displaying/hiding HTML (or TTML) snippets at precise offsets on the media timeline.

ingararntzen commented 6 years ago

What is needed is a means of displaying/hiding HTML (or TTML) snippets at precise offsets on the media timeline.

@palemieux this is exactly what I described above.

The sequencer of the timingsrc library does this. It may be used with any data, including HTML or TTML.

chrisn commented 6 years ago

Not sure if this is relevant to the original issue, which I understood to be about accurate frame stepping - not sync during playback?

@ingararntzen It is a different use case, but a good one nonetheless. Presumably, frame accurate time reporting would help with synchronised media playback across multiple devices, particularly where different browser engines are involved, each with a different pipeline delay. But, you say you're already achieving sub-frame rate sync in your library, based on currentTime, so maybe not?

nigelmegitt commented 6 years ago

@ingararntzen forgive my lack of detailed knowledge, but the approach you describe does raise some questions at least in my mind:

does it change the event handling model so that it no longer uses Time Marches On?
What happens if the event handler for event n completes after event n+1 should begin execution?
Does the timing object synchronise against the video or does it cause the video to be synchronised with it? In other words, in the case of drift, what moves to get back into alignment?
How does the interpolating clock deal with non-linear movements along the media timeline in the video, such as pause, fast forward and rewind?

Just questions for my understanding, I'm not trying to be negative!

Daiz commented 6 years ago

On the matter of "sub-framerate sync", I would like to point out that for the purposes of high quality media playback, this is not enough. Things like subtitle scene bleeds (where a cue remains visible after a scene change occurs in the video) are noticeable and ugly even if they remain on-screen for just an extra 15-30 milliseconds (ie. less than a single 24FPS frame, which is ~42ms) after a scene change occurs. Again, you can clearly see this yourself with the background color change in this test case (which has various tricks applied to increase accuracy) - it is very clear when the sync is even slightly off. Desktop video playback software outside browsers do not have issues in this regard, and I would really like to be able to replicate that on the web as well.

ingararntzen commented 6 years ago

@nigelmegitt These are excellent questions, thank you 👍

does it change the event handling model so that it no longer uses Time Marches On?

yes. the sequencer is separate from the media element (which also means that you can use it for use cases where you don't have a media element). It takes direction from a timing object, which is basically just a thin wrapper around the system clock. The sequencer uses <setTimeout()> to schedule enter/exit events at the correct time.

What happens if the event handler for event n completes after event n+1 should begin execution?

Being run in the js environment, sequencer timeouts may be subject to delay if there are many other activities going on (just like any appcode). The sequencer guarantees the correct ordering, and will report how much it was delayed. It something like the sequencer was implemented by browsers natively, this situation could be improved further I suppose. The sequencer itself is light-weight, and you may use multiple for different data sources and/or different timing objects.

Does the timing object synchronise against the video or does it cause the video to be synchronised with it? In other words, in the case of drift, what moves to get back into alignment?

Excellent question! The model does not mandate one or the other. You may 1) continuously update the timing object from the currentTime, or 2) you may continuously monitor and adjust currentTime to match the timing object (e.g. using variable playbackrate).

Method 1) is fine if you only have one media element, you are doing sync only within one webpage, and you are ok with letting the media element be the master of whatever else you want to synchronize. In other scenarios you'll need method 2), for at least (N-1) synchronized things. We use method 1) only occasionally.

The timingsrc has a mediasync function for method 2) and a reversesync function for method 1) (...I think)

How does the interpolating clock deal with non-linear movements along the media timeline in the video, such as pause, fast forward and rewind?

The short answer: using mediasync or reversesync you don't have to think about that, it's all taken care of.

Some more details: The mediasync library creates a interpolated clock internally as an approximation on currentTime. It can distinguish the natural increments and jitter of currentTime from hard changes by listening to events (i.e. seekTo, variableplaybackrate etc.)

ingararntzen commented 6 years ago

@chrisn

Presumably, frame accurate time reporting would help with synchronised media playback across multiple devices, particularly where different browser engines are involved, each with a different pipeline delay. But, you say you're already achieving sub-frame rate sync in your library, based on currentTime, so maybe not?

So, while the results are pretty good, there is no way to ensure that they are always that good (or that they will stay this good), unless these issues are put on the agenda through standardization work.

There are a number of ways to improve/simplify sync.

as you say, exposing accurate information on downstream delays, frame count, media offset is always a good thing.
currentTime values are also not timestamped, which means that you dont really know when it was sampled internally.
The jitter of currentTime is terrible.
Good sync depends on an interpolated clock. I guess this would also make it easier to convert back and forth between media offset and frame numbers.
there are also improvements seekTo and playbackrate which would improve things considerably

nigelmegitt commented 6 years ago

you don't have to think about that

@ingararntzen in this forum we certainly do want to think about the details of how the thing works so we can assure ourselves that eventual users genuinely do not have to think about them. Having been "bitten" by the impact of timeupdate and Time Marches On we need to get it right next time!

nigelmegitt commented 6 years ago

Having noted that Time Marches On can conformantly not be run frequently enough to meet subtitle and caption use cases, it does have a lot of other things going for it, like smooth handling of events that take too long to process.

In the spirit of making the smallest change possible to resolve it, here's an alternative proposal:

Change the minimum frequency to 50 times per second, instead of 4 times per second.

I would expect that to be enough to get frame accuracy at 25fps.

ingararntzen commented 6 years ago

@nigelmegitt - sure thing - I was more thinking of the end user here - not you guys :)

If you want me to go more into details that's ok too :)

kevinmarks-b commented 6 years ago

Assuming that framerates are uniform is going to go astray at some point, as mp4 can contain media with different rates. The underlying structure has Movie time and Media time - the former is usually an arbitrary fraction, the latter a ratio specifically designed to represent the timescale of the actual samples, so for US-originated video this will be 1001/30000.

Walking through the media rates and getting fame times is going to give you glitches with longer files

If you want to construct an API like this I'd suggest mirroring what QuickTime did - this had 2 parts: the movie export API, which would give you callbacks for each frame rendered in sequence, telling you the media and movie times. Or the GetNextInterestingTime() API which you could call iteratively and it would do the work of walking the movie, track edits and media to get you the next frame or keyframe.

Mozilla did make seekToNextFrame, but that was deprecated: https://developer.mozilla.org/en-US/docs/Web/API/HTMLMediaElement/seekToNextFrame

mfoltzgoogle commented 6 years ago

@Diaz For your purposes, is it more important to have a frame counter, or an accurate currentTime? What do you believe currentTime should represent?

Daiz commented 6 years ago

@mfoltzgoogle That depends - what exactly do you mean by a frame counter? As in, a value that would tell me the absolute frame number of the currently displayed frame, like if I have a 40000 frame long video with a constant frame rate of 23.976 FPS, and when currentTime is about 00:12:34.567 (754.567s), this hypothetical frame counter would have a value of 18091? This would most certainly work be useful for me.

To reiterate, for me the most important use case for frame accuracy right now would be to accurately snap subtitle cue changes to frame changes. A frame counter like described above would definitely work for this. Though since I personally work on premium VOD content where I'm in full control of the content pipeline, accurate currentTime (assuming that it means that with a constant frame rate / full frame rate information I would be able to reliably calculate the currently displayed frame number) would also work. But I think the kind of frame counter described above would be a better fit as more general purpose functionality.

mfoltzgoogle commented 6 years ago

We would need to consider skipped frames, buffering states, splicing MSE buffers, and variable FPS video to nail down the algorithm to advance the "frame counter", but let's go with that as a straw-man. Say, adding a .frameCounter read-only property to <video>.

When you observe the .frameCounter for a <video> element, say in requestAnimationFrame, which frame would that correspond to?

palemieux commented 6 years ago

@mfoltzgoogle Instead of a "frame counter", which is video-centric, I would consider adding a combination of timelineOffset and timelineRate, with timelineOffset being an integer and timelineRate a rational, i.e. two integers. The absolute offset (in seconds) is then given by timelineOffset divided by timelineRate. If timelineRate is set to the frame rate, then timelineOffset is equal to an offset in # of frames. This can be adapted to other kinds of essence that do not have "frames".

Daiz commented 6 years ago

When you observe the .frameCounter for a

For frame accuracy purposes, it should obviously correspond to the currently displayed frame on the screen.

Also, something that I wanted to say is I understand that there's a lot of additional complexity to this subject under various playback scenarios and that it's probably not possible to guarantee frame accuracy under all scenarios. However, I don't think should stop us from pursuing frame accuracy where it would indeed be possible. Like if I have just a normal browser window in full control of video playback playing video on a normal screen attached to my computer, even having frame accuracy just there alone would be a huge win in my books.

nigelmegitt commented 6 years ago

The underlying structure has Movie time and Media time - the former is usually an arbitrary fraction, the latter a ratio specifically designed to represent the timescale of the actual samples, so for US-originated video this will be 1001/30000.

@kevinmarks-b "media time" is also used elsewhere as a generic term for "the timeline related to the media", independently of the syntax used, i.e. it can be expressed as an arbitrary fraction or a number of frames etc, for example in TTML.

nigelmegitt commented 6 years ago

the most important use case for frame accuracy right now would be to accurately snap subtitle cue changes to frame changes. A frame counter like described above would definitely work for this.

@Daiz I agree the use case is important and would like to achieve the same result, but I disagree that a frame counter would work. In fact, a frame counter would absolutely not work!

The reason is that we typically distribute a single subtitle file but have multiple profiles of video encoding, where one approach to managing the bitrate adaptively is to vary the frame rate. I think our lowest frame rate profile is about 6.25 fps. In that situation, quantizing subtitle times to the frame rate is a very bad idea. For more on this, see EBU-TT-D annex E.

That's why we use media time expressions with decimal fractions of seconds, and arrange that those time expressions work against the media at some canonical frame rate, such as 25fps in the knowledge that it will also work at other frame rates.

Daiz commented 6 years ago

@nigelmegitt Do note that I was primarily talking about my use case - I do the same thing with multiple profiles and single subtitle file(s), but I keep the frame rate consistent across all the variants.

Still, even for your use case with varying frame rates I'd expect the frame counter to be useful since even if you couldn't use the frame numbers themselves, you could still observe the value for the exact moments when frames change and act on that. Though if you have information about loaded chunks (and which ones are lower framerate), then it shouldn't be too hard to make use of the frame number itself either (this really applies in general with variable frame rate - as long as you have full information about the variations exposed to JS (which could even be eg. pre-formatted data sent by the server for the application, not necessarily something provided by VideoElement/MediaElement itself), it should theoretically be possible to always be up to date on where you are in terms of video both frame- and time-wise).

kevinmarks-b commented 6 years ago

This is the difficulty when the subtitles are outside the composition engine, and this is where losing QT's multi-dataref abstraction for media hurts. Text tracks did make it into mp4, but I don't think the ability to edit them dynamically did. @nigelmegitt for your decimation use case, having the subtitles on the timescale of the highest framerate video makes sense.

The baseline assumption that captions and subtitles should obscure the video is also odd to me - it's a hangover from analogue displays and title-safe ares with overscan. With the abundance of pixels we have now, rendering the captions or subtitles in a separate screenspace that doesn't obscure the action seems hugely preferable, and would mitigate the composition issues.

Daiz commented 6 years ago

The baseline assumption that captions and subtitles should obscure the video is also odd to me

With the abundance of pixels we have now, rendering the captions or subtitles in a separate screenspace that doesn't obscure the action seems hugely preferable

Rendering the subtitles outside the video area is generally a pretty terrible idea from both readability and ergonomic standpoints. When the subtitles are on screen, in decent font, decent font size, with decent margins and decent styling, then you can read them by basically just glancing while keeping your primary focus on the video itself the whole time, which is important because you are watching constantly progressing and moving video at the same time. Not to mention that video content is often watched from a much larger distance (eg. TV screens). If the subtitles are outside the video frame, suddenly you have to constantly move your eyes in order to read them, which would make for a terrible viewing experience all around.

To borrow a demonstration from an old Twitter thread of mine, here's an example of bad subtitle styling:

styling_bad

Some of the issues:

Way too little vertical margin - more likely to require active eye movement in order to read the subtitles at all
Small font size - making the text too small will require more focusing in order to take in the text
Non-optimal styling in general - the border is thin and there's no shadow to "elevate" the subs from video, can result in subs blending to BG thus making them harder to read
No line breaking - it's faster to read two short lines in a Z-like motion than moving your eyes over a wide horizontal line

Here's the same subs again with the aforementioned issues fixed:

styling_good

From a pixel counting perspective, these subs indeed obscure more of the video than the former example (or if you placed the subs outside the video frame entirely), but from a user experience standpoint, it actually enables the viewer to focus much better on the content since they don't have to actively divert their attention and focus away from the video to read the subtitles.

Apologies for the long and mostly off-topic post, but I think it's important to point out that "why don't we just render the subs off screen" is not a good strategy to pursue and recommend.

Snarkdoof commented 6 years ago

In comment to @nigelmegitt [1] increase the timeupdate to 50hz, I'd just like to point out two things:

As transferring the playback state (e.g. currentTime) from player to JS takes more than 0 time, the value received in JS will always be outdated. We see this very clearly as massive jitter in the currentTime reporting. Going to 50hz will make this error smaller, but it will still be there.
Triggering events with event handlers at 50hz will make increase resource usage by a lot, even if there is no real reason for it. For example, updating a progress bar at 50hz is just silly, similarly, a subtitle that is shown for 5.4 seconds will have executed the code for checking if it should be removed 269 times for no reason.

In my opinion, using a sequencer as @ingararntzen suggests is a much better idea - as long as the play state continues normally, timeouts will ensure minimal resource usage and high precision. This means that we need a timestamp added to the event (the time when it was generated). This holds for all media events.

[1] https://github.com/w3c/media-and-entertainment/issues/4#issuecomment-397209849

ingararntzen commented 6 years ago

@Snarkdoof - as you indicate, if timeUpdate events were timestamped by the media player (using a common clock - say performance.now), then it would become much easier to maintain a precise, interpolated media clock in JS. Importantly, the precision of such an interpolated clock would not be highly sensitive to the frequency of the timeUpdate event (so we could avoid turning that up unnecessarily).

In addition, a precise interpolated media clock is a good basis for using timeouts to fire enter/exit events for subtitles etc. If one would prefer enter/exit events to be generated natively within the media element, using timeouts instead of poll-based time-marches-on should be an attractive solution there as well.

So, adding timestamps to media events seems to me like a simple yet important step forward. Similarly, there should be timestamps associated with control events such as play, pause, seekTo, ...

nigelmegitt commented 6 years ago

you could still observe the value for the exact moments when frames change and act on that

@Daiz It is more important to align with audio primarily, and video frames secondarily. Otherwise you end up with the quantisation problem where you only update subtitles on frame boundaries, and the system breaks down at low frame rates. Whatever design we end up with can't require alignment with encoded video frames in all cases for that reason. Note that playback systems sometimes generate false frames after decoding, which generates a new set of frame boundaries!

nigelmegitt commented 6 years ago

The baseline assumption that captions and subtitles should obscure the video is also odd to me - it's a hangover from analogue displays and title-safe ares with overscan. With the abundance of pixels we have now, rendering the captions or subtitles in a separate screenspace that doesn't obscure the action seems hugely preferable, and would mitigate the composition issues.

@kevinmarks-b Testing with the audience has shown that in limited cases they do prefer subtitles outside the video, particularly when the video is embedded in a bigger page and is relatively small. In general though the audience does prefer subtitles super-imposed over the video. I speculate that smaller eye movements are easier and result in less of the video content being missed during the eye movement.

nigelmegitt commented 6 years ago

@Snarkdoof , in response to https://github.com/w3c/media-and-entertainment/issues/4#issuecomment-397593999, you make good points, thank you!

kevinmarks-b commented 6 years ago

@nigelmegitt that is interesting - have you tested that for letterboxed content? Putting subtitles in the black rather than over the picture?

Daiz commented 6 years ago

@kevinmarks-b I can't necessarily speak for in general but I personally prefer for the subtitles to remain in the actual video area even with 2.35:1 content, though I tend to pair that with a smaller vertical margin than what I'd use with 16:9 content:

subvideoframe

nigelmegitt commented 6 years ago

@kevinmarks-b I'm not aware of tests with letterboxed content, though I know it is a practice that is used for example in DVD players that play 16:9 video in a 4:3 aspect ratio video output. I've not seen many implementations that work well though - there is often some overlap between the subtitles and the active image, which is somewhat jarring.

More importantly, letterboxing tends to be an exceptional case rather than the norm, so it is not a general solution.

nigelmegitt commented 6 years ago

Thinking more about https://github.com/w3c/media-and-entertainment/issues/4#issuecomment-397593999 :

transferring the playback state (e.g. currentTime) from player to JS takes more than 0 time, the value received in JS will always be outdated.

Can this not be modelled or measured adaptively though to minimise the impact?

Triggering events with event handlers at 50hz will make increase resource usage by a lot, even if there is no real reason for it.

@Snarkdoof have you got data to back this up? Running time marches on when the list of events to add to the queue is empty is not going to take many CPU clock cycles. Relative to the resource needed to play back video, I suspect it is tiny. But the advantage of doing it is huge when there is an event to add to the queue. It'd be good to move from speculation (on both our parts!) to measurement.

Daiz commented 6 years ago

@nigelmegitt Aligning subtitle cue changes with video frames is very much important too for high quality media playback purposes. Here's a quick demonstration of subtitle scene bleeding that you can get if the cue changes are not properly aligned with frame changes - in this example the timing is a whole frame off (~42ms), but shorter bleeds are similarly notable and extremely ugly. It's true that frame alignment may not be always desirable (mostly if you're dealing with very low FPS video), but it should definitely be possible. As I mentioned earlier, desktop playback software does not have issues in this regard, and I'd really like for that to be the case for web video as well.

Snarkdoof commented 6 years ago

@nigelmegitt Compensating for the transfer time between player and JS execution can, as I probably worded badly, be compensated for by adding a timestamp for when the even was created (or the data "created" if you wish). A timestamp should in my opinion be added to ALL events, in particular media events - one can then see that 32ms ago the player was at position X, which means that you can now be quite certain that the player is at X+32ms right now. :)

I don't have any data to back up the resource claim, but logic dictates that any code looping at high frequency will necessarily keep the CPU awake. For many devices, video decoding is largely done in HW, and keeping CPUs running on lower power states is terribly important. My largest point really is that if we export the clock properly (e.g. performance now + currentTime), it's trivial to calculate the correct position at any time in JS. This makes it very easy to use timeouts to wake up directly from JS.

I've got a suspicion that you are not really looking for ways to use JS to cover your needs in regards to timed data but rather to have built in support in the players and use Data Cues etc. Just to be clear - the JS sequencer @ingararntzen has mentioned typically wakes up and provides callbacks with well under 1ms error on most devices - if that doesn't cover all subtitle needs, I don't know what kind of eye sight you guys have. ;-) We have larger issues with CSS updates (e.g. style.opacity=1) taking longer on slower devices (e.g. Raspberry Pis) than we do with synchronizing the actual function call.

tidoust commented 6 years ago

One comment on @nigelmegitt's https://github.com/w3c/media-and-entertainment/issues/4#issuecomment-397209849 and @Snarkdoof's https://github.com/w3c/media-and-entertainment/issues/4#issuecomment-397593999. There seems to be a slight confusion between the frequency at which the "time marches on" algorithm runs and the frequency at which that algorithm triggers timeupdate events.

The "time marches on" algorithm only triggers events when needed, and timeupdate events once in a while. Applications willing to act on cues within a particular text track should not rely on timeupdate events but rather on cuechange events of the TextTrack object (or on enter/exit events of individual cues), which are fired as needed whenever the "time marches on" algorithm runs.

The HTML spec requires the user agent to run the "time marches on" algorithm when the current playback position of a media element changes, and notes that this means that "these steps are run as often as possible". The spec also mandates that the "current playback position" be increased monotonically when the media element is playing. I'm not sure how to read that in terms of minimum/maximum frequency. Probably as giving full leeway to implementations. Running the algorithm at 50Hz seems doable though (and wouldn't trigger 50 events per second unless there are cues that need to switch to a different state). Implementations may optimize the algorithm as long as it produces the same visible behavior. In other words, they could use timeouts if that's more efficient than looping through cues each time.

tidoust commented 6 years ago

When you observe the .frameCounter for a element, say in requestAnimationFrame, which frame would that correspond to?

For frame accuracy purposes, it should obviously correspond to the currently displayed frame on the screen.

@Daiz requestAnimationFrame typically runs at 50-60Hz, so once every 16-20ms, before the next repaint. You mentioned elsewhere that 15-30ms delays were noticeable for subtitles. Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

I'm not saying that it's easy to get from an implementation perspective, given the comment raised by @mfoltzgoogle https://github.com/w3c/media-and-entertainment/issues/4#issuecomment-396701652. In particular, I suspect browser repaints are not necessarily synchronized with video repaints, but the problem seems to exist in any case.

nigelmegitt commented 6 years ago

Thanks @tidoust that is helpful. Last time I looked at this, a month or so back, I assured myself that time marches on itself could be run only every 250ms conformantly, but the spec text you pointed to suggests that timing constraint only applies to timeupdate events. Now I wonder if I misread it originally.

Nevertheless, time marches on frequency is dependent on some unspecified observation of the current playback position of the media element changing, which looks like it should be more often than 4Hz (every frame? every audio sample?).

In practice, I don't think browsers actually run time marches on whenever the current playback position advances by e.g. 1 frame or 1 audio sample. The real world behaviour seems to match the timing requirements for firing timeupdate events, at the less frequent end.

Snarkdoof commented 6 years ago

@tidoust It's a good point that the "internal" loop of Time Marches On does not trigger JS events every time, but increasing the loop speed of any loop (or doing more work in each pass) will use more resources. As I see it there are two main ways of timing things that are relevant to media on the web: 1: Put the "sequencer" logic (what's being triggered when) inside the browser and trigger events and 2: Put the "sequencer" logic in JS and trigger events

If 1) is chosen, a lot more complexity is moved to a very tight loop that's frankly busy with more important stuff. It is also less flexible as arbitrary code cannot be run in this way (nor would we want it to!). 2) depends solely on exporting a timestamp with the currentTime (and preferably other media events too), which would allow a JS Timing Object to accurately export the internal clock of the media. As such, a highly flexible solution can be made using fairly simple tools, like the open timingsrc implementation. Why would we not want to choose a solution that is easier, more flexible and if anything, saves CPU cycles?

cuechange also has a lot of other annoying issues, like not trigging when as skip event occurs (e.g. jumping "mid" subtitle), making it necessary to cut and paste several lines of code to check the active cues to behave as expected.

nigelmegitt commented 6 years ago

1) is chosen, a lot more complexity is moved to a very tight loop that's frankly busy with more important stuff.

@Snarkdoof is it really busy with more important stuff? Really?

2: Put the "sequencer" logic in JS and trigger events

Browsers only give a single thread for event handling and JS, right? So adding more code to run in that thread doesn't really help address contention issues.

cuechange also has a lot of other annoying issues, like not trigging when as skip event occurs

The spec is explicit that it is supposed to trigger in this circumstance. Is this a spec vs implementation-in-the-real-world issue?

I have the sense that we haven't got good data about how busy current CPUs are handling events during media playback in a browser, with subtitles alongside. The strongest requirements statement we can make is that we do want to achieve adequate synchronisation (whatever "adequate" is defined as) with minimal additional resource usage.

w3c / media-and-entertainment

Frame accurate seeking of HTML5 MediaElement #4