Frame accurate seeking of HTML5 MediaElement

tidoust commented 6 years ago

I've heard a couple of companies point out that one of the problems that makes it hard (at least harder than it could be) to do post-production of videos in Web browsers is that there is no easy way to process media elements on a frame by frame basis, whereas that is the usual default in Non-Linear Editors (NLE).

The currentTime property takes a time, not a frame number or an SMPTE timecode. Conversion from/to times to/from frame numbers is doable but supposes one knows the framerate of the video, which is not exposed to Web applications (a generic NLE would thus not know about it). Plus that framerate may actually vary over time.

Also, internal rounding of time values may mean that one seeks to the end of the previous frame instead of the beginning of a specific video frame.

Digging around, I've found a number of discussions and issues around the topic, most notably:

An long thread from 2011 on Frame accuracy / SMPTE, which led to improvements in the precision of seeks in browser implementations: https://lists.w3.org/Archives/Public/public-whatwg-archive/2011Jan/0120.html
A list of use cases from 2012 for seeking to specific frames. Not sure if these use cases remain relevant today: https://www.w3.org/Bugs/Public/show_bug.cgi?id=22678
A question from 2013 on whether there was interest to expose "versions of currentTime, fastSeek(), duration, and the TimeRanges accessors, in frames, for video data": https://www.w3.org/Bugs/Public/show_bug.cgi?id=8278#c3
A proposal from 2016 to add a rational time value for seek() to solve rounding issues (still open as of June 2018): https://github.com/whatwg/html/issues/609

There have probably been other discussions around the topic.

I'm raising this issue to collect practical use cases and requirements for the feature, and gauge interest from media companies to see a solution emerge. It would be good to precisely identify what does not work today, what minimal updates to media elements could solve the issue, and what these updates would imply from an implementation perspective.

Daiz commented 6 years ago

@tidoust

Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

Good point. The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

jpiesing commented 6 years ago

Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

Good point. The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

I agree with that aim butt then you need to be very careful about definitions as there may be several frame-times worth of delay between where graphics and video are composited and what the user is actually seeing. I suspect both are needed!

nigelmegitt commented 6 years ago

The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

@Daiz as I just pointed out on the M&E call (minutes), this is only known to be true at the 25-30fps sort of rate, designed to be adequately free of flicker for video. It's unknown at high frame rates, and entirely inadequate at low frame rates, where synchronisation with audio is more important.

We should avoid generalising based on assumptions that the 25-30fps rate will continue to be prevalent, and gather data where we don't yet have it. We also need a model that works for other kinds of data than subtitles and captions, since they may have more or less stringent synchronisation requirements.

tidoust commented 6 years ago

@Snarkdoof Like @nigelmegitt, I don't necessarily follow you on the performance penalties. Regardless, what I'm getting out of this discussion on subtitles is that there are possible different ways to improve the situation (they are not necessarily exclusive).

One possible way would be to have the user agent expose a frame number, or a rational number. This seems simple in theory, but apparently hard to implement. Good thing is that it would probably make it easy to act on frame boundaries, but these boundaries might be slightly artificial (because the user agent will interpolate these values in some cases).

Another way would be to make sure that an application can relate currentTime to the wall clock, possibly completed with some indication of the downstream latency. This is precisely what was done in the Web Audio API (see the definition of the AudioContext interface and notably the getOutputTimestamp() method and the outputLatency property). It seems easier to implement (it may be hard to compute the output latency, but adding a timestamp whenever currentTime changes seems easy). Now an app will still have some work to do to detect frame boundaries, but at least we don't ask the user agent to report possibly slightly incorrect values.

I note that this thread started with Non-Linear Editors. If someone can elaborate on scenarios there and why frame numbers are needed there, that would be great! Supposing they are, does the application need to know the exact frame being rendered during media playback or is is good enough if that number is only exact when the media is paused/seeked?

Snarkdoof commented 6 years ago

@nigelmegitt @tidoust - I guess I just never understood the whole time marches on algorithm to be honest, it seems like a very strange way to wait for a timeout to happen, in particular when the time to wait can be very reliably be calculated well in advance. The added benefit of doing this properly in JS is that the flexibility is excellent - there is no looping anywhere, there is an event after a setTimeout, re-calculated when some other event is triggered (skip, pause, play etc). We use it for all kinds of things - showing subtitles, switching between sources, altering css, preloading images at a fixed time, etc. Preloading is trivial if you give a sequencer a time shifted timing object. Say you need up to 9 seconds to prepare an image - time shift it to 10 seconds more than the playback clock and do nothing else!

I might of course be absolutely in the black on the Time marches on, text and data cues (I did test them, and found them horrible a couple of years ago). But the only thing I crave is the timestamp on the event - it will solve our every need (almost) and at barely any cost. :)

Daiz commented 6 years ago

@nigelmegitt As I also mentioned earlier, yes, I recognize that there are different things that are important too, but for the here and now (and I don't expect this to change anytime soon), I want the ability to act on exact frame boundaries so that I can get as close to replicating the desktop high quality media playback experience on the web as possible, and having subtitles align on frame boundaries for scene changes in order to avoid scene bleeding is a basic building block of that.

I'm not too concerned with the exact details of how we get there, so that's open for discussion and what we're here for, but the important thing is that we do get there eventually in a nice and performant fashion (ie. one shouldn't have to compile a video decoder with emscripten to do it etc).

ingararntzen commented 6 years ago

In respons to @Snarkdoof's post about the two approaches to synchronizing cue events and @nigelmegitt's response

The strongest requirements statement we can make is that we do want to achieve adequate synchronisation (whatever "adequate" is defined as) with minimal additional resource usage.

I don't have have any input on the question on resource consumption, but a point concerning maximization of precision:

It is an important principle to put the synchronization as close as possible to what is being synchronized. Another way to put it is to say that the final step matters.

In approach 1, with sequencing logic internally in the media element, the last step is transport of the cue events across threads to JS.

In approach 2, with sequencing logic in JS, the final step is the firing of a timeout in JS. This seems precise down to 1 or 2 ms. Additionally the correctness of the timeout calculation depends on the correctness by which currentTime can be calculated in JS, which is also very precise (and could easily be improved).

I don't know the relevant details of approach 1). I'm worried that the latency of the thread switch might unknown or variable, and perhaps different across architectures. If so, this would detract from precision, but I don't know how much. Do anyone know?

Also, in my understanding a busy event loop in JS affects both approaches similarly.

nigelmegitt commented 6 years ago

I want the ability to act on exact frame boundaries so that I can get as close to replicating the desktop high quality media playback experience on the web

@Daiz OK, within the constraints of your use case, I share the requirement. Outside of those constraints, it gets more complicated. Seems from the thread as though that's something we can both agree to.

nigelmegitt commented 6 years ago

There's been some speculation here about thread switching and the impact that may have, and if indeed there are multiple threads executing the script and therefore processing the event queue. It's always been my understanding that the script is only executed in a single thread. Can anyone clarify this point, perhaps a browser implementer?

boushley commented 6 years ago

Throwing my hat in the ring here with a couple alternative use cases. As background my company manages a large amount of police body camera video. We support redaction of video via the browser, as well as playback of evidence via the browser.

For the redaction and evidence playback use cases our customers want the ability to step through a video frame-by-frame. If you assume a constant framerate and are able to determine that framerate out of band then you can get something that approximates frame-by-frame seek. However there are many scenarios (be it rounding of the currentTime value, or encoder delay that renders a frame a few ms late) that can result in a frame being skipped (which is a big worry for our customers). There are hacks around this (rendering frames on the server and shipping down frame by frame view) but all the info we need is already in the browser, it would be great if we had the ability to progress through a video frame by frame.

For redaction we have a use case that is similar to the subtitles sync issue. When users are in the editing phase of redaction we do a preview of what will be redacted where we need JS controlled objects to be synced with the video as tightly as we can. In this use case it's slightly easier than subtitles because when playing back at normal speed (or 2x or 4x) redaction users are usually ok with some slight de-sync. If they see something concerning they usually pause the video and then investigate it frame-by-frame.

Some of the suggested solutions, like currentFrameTime, could be extended to enable the frame-by-frame use case.

tidoust commented 6 years ago

@boushley Thanks, that is useful! From a user experience perspective, how would the frame-by-frame stepping work in your case, ideally?

The user activates frame-by-frame stepping. Video playback is paused. The user controls which frame to render and when a new frame needs to be rendered (e.g. with a button or arrow keys). Under the hoods, the page seeks to the right frame, and video playback is effectively paused during the whole time.
The user activates frame-by-frame stepping. The video moves from one frame to the other in slow motion without user interaction. Under the hoods, the page does that by setting playbackRate to some low value such as 0.1, and the user agent is responsible for playing back the video at that speed.

In both cases, it seems indeed hard to do frame by frame stepping without exposing the current frame/presentation time, and allowing the app to set it to some value to account for cases where the video uses variable framerate.

It seems harder to guarantee precision in 2. as seen in this thread [1], but perhaps that's doable when video is played back at low speed?

[1] https://github.com/w3c/media-and-entertainment/issues/4#issuecomment-396701652

dholroyd commented 6 years ago

I note that this thread started with Non-Linear Editors. If someone can elaborate on scenarios there and why frame numbers are needed there, that would be great! Supposing they are, does the application need to know the exact frame being rendered during media playback or is is good enough if that number is only exact when the media is paused/seeked?

We also perform media manipulation server-side on the basis of users choosing points in the media timeline in a browser-based GUI. Knowing exactly what the user is seeing when the media is paused is critical.

Challenges that we've found with current in-browser capabilities include,

Allowing the user to reliably review the points in time previously selected
- Expected behaviour - browser will seek to a previously selected time-point and the user will see the same content as when they made their selection
- Actual behaviour - in some cases, in some browsers, the frame the users sees may be off-by-one
- Another way of describing the above is just to observe that, with playback paused, sometimes executing video.currentTime = video.currentTime in the js console will change the displayed video frame!
Matching results that server-side processing with what the user requested in the browser
- Expected behaviour - the point on the media timeline 'chosen' by the user is reflected by back-end processing
- Actual behaviour - it seems challenging to relate a currentTime value from the browser to a point on the media timeline within server-side components
- To make the above more concrete, if you wanted to run ffmpeg on the server-side and have it make a jpg of video frame that the user is currently looking at, how would you transform the value of currentTime (or any other proposed mechanism) into a select video filter. (Substitute ffmpeg with your preferred media framework as desired :)

We currently do frame-stepping by giving the js knowledge (out of band) of the frame-rate and seeking in frame-sized steps.

Users also want to be able to step backwards and forwards by many frames at a time (e.g. hold 'shift' to skip faster). That's currently implemented by just seeking in larger steps.

boushley commented 6 years ago

@tidoust our current experience is that the user has a skip ahead / skip back X seconds control. When they pause that changes to a frame forward / frame back control. So we're definitely looking at use case 1. And if you're going for playback at something like 1/10 of normal speed (or 3-6 fps) you can pretty easily pull that off in JS if you have a way of progressing to the next frame or previous frame. This use case feel like it should be easily doable, although I think it'll be interesting if we can do it in a way that enables other use cases as well.

@dholroyd we've definitely seen some of these off by a single frame issues in our redaction setup. Would be great if there was a better way of identifying and linking between a web client and a backend process manipulating the video. I believe one of the keys for the editing style use case is that while we want playback to be as accurate as possible, the key is that when paused it needs to be exactly accurate.

mfoltzgoogle commented 6 years ago

@Diaz I spoke with the TL of Chrome's video stack and they gave me a pointer to an implementation that you can play around with now.

First, behind --enable-experimental-canvas-feature, are some additional attributes on HTMLVideoElement that contain metadata about frames uploaded as WebGL textures, including timestamp. [1]

The longer term plan is a WebGL extension to expose this data [2], and implementation has begun [3] but I am not sure of its status.

I agree there are use cases outside of WebGL upload for accurate frame timing data, and it should be possible to provide it on HTMLVideoElement's that are not uploaded as textures. However, if the canvas/WebGL solution works for you, then that makes a stronger case to expose it elsewhere.

Note that any solution may be racy with vsync depending on the implementation and it may be off by 16ms depending on where vsync happens in relation to the video frame rendering and the execution of rAF.

That's really all the help I can provide at this time. There are many other use cases and scenarios discussed here that I don't have time to address or investigate them right now.

Thanks.

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=639174 [2] https://www.khronos.org/registry/webgl/extensions/proposals/WEBGL_video_texture/ [3] https://bugs.chromium.org/p/chromium/issues/detail?id=776222

chrisn commented 6 years ago

This is a great discussion, identifying a number of different use cases. I suggest that the next step is to consolidate this into an explainer document that describes each use case and identifies any spec gaps or current implementation limitations. A simple markdown document in this repo would be fine. Would anyone like to start such a document?

KilroyHughes commented 6 years ago

One detail for such a document (I'm not volunteering to write) is video frame reordering. Widely deployed video codecs such as AVC reorder and often offset the presentation time of pictures relative to their order and timing in the compressed bitstream. For instance, frames 1, 2, 3, 4 in the compressed stream might be displayed in order e.g. 2, 1, 4, 3 and presentation time can be delayed several frames. Frame rate changes are not unusual in adaptively streamed video. Operations such as seeking, editing, and splicing of the compressed stream, e.g. in an MSE buffer, do not happen at the presentation times often assumed. Audio, TTML, HTML, events, etc. must take presentation reordering and delay into account for frame accurate synchronization at some "composition point" in the media pipeline.

nigelmegitt commented 6 years ago

@KilroyHughes I've always made the assumption that all those events are related to the post-decode (and therefore post-reordering) output. It would make no sense to address out of order frame counts from the compressed bitstream in specifications whose event time markers relate to generic video streams and for which video codecs are out of scope.

Certainly in TTML, the assumption is that there is a well defined media timeline against which times in the media timebase can be related; taking DASH/MP4 as an example, the track fragment decode time as modified by the presentation time offset provides that definition.

I'd push back quite strongly against any requirement to traverse the architectural layers and impose content changes on a resource like a subtitle document, whether it is provided in-band or out-of-band, just to take into account a specific set of video encoding characteristics.

nigelmegitt commented 5 years ago

There's a Chromium bug about synchronisation accuracy of Text Track Cue onenter() and onexit() events in the context of WebVTT at https://bugs.chromium.org/p/chromium/issues/detail?id=576310 and another (originally from me, via @beaufortfrancois) asking for developer input on the feasibility of reducing the accuracy threshold in the spec from the current 250ms, at https://bugs.chromium.org/p/chromium/issues/detail?id=907459 .

1c7 commented 5 years ago

Because this thread is way too long. I didn't read them all. Let me provide one more use case

Subtitle Editing software

I want to build a Subtitle Editing software using Electron.js because Aegisub is not good enough. (hotkey, night mode, etc)

The point is:

I want build something that simple but able to improve one part of workflow. not aim to replace Aegisub. because they have way to many feature.

So

Frame by frame and precise control to millisecond like 00:00:12:333 is important.

Here is my design (it's screenshot from Invision Studio, not an actually desktop)

I design many version because I want this to be beautiful

Here is Electron app (an actually working app)

as you can see the Electron app is still Work in Progress. half-built.

and now I found out there are no Frame by frame and precise control to millisecond like 00:00:12:333 which is very bad..

Conclusion

Use some hack like <canvas> OR just abandon Web tech(HTML/CSS/JS) Electron.js just build Native app (OC & Swift on XCode)

tidoust commented 4 years ago

A couple of updates:

The Media & Entertainment IG discussed the issue at TPAC. Also see the Frame accurate synchronization slides I presented to guide the discussion.
Also, for frame accuracy rendering scenarios (during playback), note the recent proposal to extend HTMLVideoElement with a requestAnimationFrame function to allow web authors to identify when and which frame has been presented for composition: