w3c / media-and-entertainment

Repository for the Media and Entertainment Interest Group
55 stars 15 forks source link

Frame accurate seeking of HTML5 MediaElement #4

Open tidoust opened 6 years ago

tidoust commented 6 years ago

I've heard a couple of companies point out that one of the problems that makes it hard (at least harder than it could be) to do post-production of videos in Web browsers is that there is no easy way to process media elements on a frame by frame basis, whereas that is the usual default in Non-Linear Editors (NLE).

The currentTime property takes a time, not a frame number or an SMPTE timecode. Conversion from/to times to/from frame numbers is doable but supposes one knows the framerate of the video, which is not exposed to Web applications (a generic NLE would thus not know about it). Plus that framerate may actually vary over time.

Also, internal rounding of time values may mean that one seeks to the end of the previous frame instead of the beginning of a specific video frame.

Digging around, I've found a number of discussions and issues around the topic, most notably:

  1. An long thread from 2011 on Frame accuracy / SMPTE, which led to improvements in the precision of seeks in browser implementations: https://lists.w3.org/Archives/Public/public-whatwg-archive/2011Jan/0120.html
  2. A list of use cases from 2012 for seeking to specific frames. Not sure if these use cases remain relevant today: https://www.w3.org/Bugs/Public/show_bug.cgi?id=22678
  3. A question from 2013 on whether there was interest to expose "versions of currentTime, fastSeek(), duration, and the TimeRanges accessors, in frames, for video data": https://www.w3.org/Bugs/Public/show_bug.cgi?id=8278#c3
  4. A proposal from 2016 to add a rational time value for seek() to solve rounding issues (still open as of June 2018): https://github.com/whatwg/html/issues/609

There have probably been other discussions around the topic.

I'm raising this issue to collect practical use cases and requirements for the feature, and gauge interest from media companies to see a solution emerge. It would be good to precisely identify what does not work today, what minimal updates to media elements could solve the issue, and what these updates would imply from an implementation perspective.

Daiz commented 6 years ago

@tidoust

Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

Good point. The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

jpiesing commented 6 years ago

Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

Good point. The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

I agree with that aim butt then you need to be very careful about definitions as there may be several frame-times worth of delay between where graphics and video are composited and what the user is actually seeing. I suspect both are needed!

nigelmegitt commented 6 years ago

The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

@Daiz as I just pointed out on the M&E call (minutes), this is only known to be true at the 25-30fps sort of rate, designed to be adequately free of flicker for video. It's unknown at high frame rates, and entirely inadequate at low frame rates, where synchronisation with audio is more important.

We should avoid generalising based on assumptions that the 25-30fps rate will continue to be prevalent, and gather data where we don't yet have it. We also need a model that works for other kinds of data than subtitles and captions, since they may have more or less stringent synchronisation requirements.

tidoust commented 6 years ago

@Snarkdoof Like @nigelmegitt, I don't necessarily follow you on the performance penalties. Regardless, what I'm getting out of this discussion on subtitles is that there are possible different ways to improve the situation (they are not necessarily exclusive).

One possible way would be to have the user agent expose a frame number, or a rational number. This seems simple in theory, but apparently hard to implement. Good thing is that it would probably make it easy to act on frame boundaries, but these boundaries might be slightly artificial (because the user agent will interpolate these values in some cases).

Another way would be to make sure that an application can relate currentTime to the wall clock, possibly completed with some indication of the downstream latency. This is precisely what was done in the Web Audio API (see the definition of the AudioContext interface and notably the getOutputTimestamp() method and the outputLatency property). It seems easier to implement (it may be hard to compute the output latency, but adding a timestamp whenever currentTime changes seems easy). Now an app will still have some work to do to detect frame boundaries, but at least we don't ask the user agent to report possibly slightly incorrect values.

I note that this thread started with Non-Linear Editors. If someone can elaborate on scenarios there and why frame numbers are needed there, that would be great! Supposing they are, does the application need to know the exact frame being rendered during media playback or is is good enough if that number is only exact when the media is paused/seeked?

Snarkdoof commented 6 years ago

@nigelmegitt @tidoust - I guess I just never understood the whole time marches on algorithm to be honest, it seems like a very strange way to wait for a timeout to happen, in particular when the time to wait can be very reliably be calculated well in advance. The added benefit of doing this properly in JS is that the flexibility is excellent - there is no looping anywhere, there is an event after a setTimeout, re-calculated when some other event is triggered (skip, pause, play etc). We use it for all kinds of things - showing subtitles, switching between sources, altering css, preloading images at a fixed time, etc. Preloading is trivial if you give a sequencer a time shifted timing object. Say you need up to 9 seconds to prepare an image - time shift it to 10 seconds more than the playback clock and do nothing else!

I might of course be absolutely in the black on the Time marches on, text and data cues (I did test them, and found them horrible a couple of years ago). But the only thing I crave is the timestamp on the event - it will solve our every need (almost) and at barely any cost. :)

Daiz commented 6 years ago

@nigelmegitt As I also mentioned earlier, yes, I recognize that there are different things that are important too, but for the here and now (and I don't expect this to change anytime soon), I want the ability to act on exact frame boundaries so that I can get as close to replicating the desktop high quality media playback experience on the web as possible, and having subtitles align on frame boundaries for scene changes in order to avoid scene bleeding is a basic building block of that.

I'm not too concerned with the exact details of how we get there, so that's open for discussion and what we're here for, but the important thing is that we do get there eventually in a nice and performant fashion (ie. one shouldn't have to compile a video decoder with emscripten to do it etc).

ingararntzen commented 6 years ago

In respons to @Snarkdoof's post about the two approaches to synchronizing cue events and @nigelmegitt's response

The strongest requirements statement we can make is that we do want to achieve adequate synchronisation (whatever "adequate" is defined as) with minimal additional resource usage.

I don't have have any input on the question on resource consumption, but a point concerning maximization of precision:

It is an important principle to put the synchronization as close as possible to what is being synchronized. Another way to put it is to say that the final step matters.

In approach 1, with sequencing logic internally in the media element, the last step is transport of the cue events across threads to JS.

In approach 2, with sequencing logic in JS, the final step is the firing of a timeout in JS. This seems precise down to 1 or 2 ms. Additionally the correctness of the timeout calculation depends on the correctness by which currentTime can be calculated in JS, which is also very precise (and could easily be improved).

I don't know the relevant details of approach 1). I'm worried that the latency of the thread switch might unknown or variable, and perhaps different across architectures. If so, this would detract from precision, but I don't know how much. Do anyone know?

Also, in my understanding a busy event loop in JS affects both approaches similarly.

nigelmegitt commented 6 years ago

I want the ability to act on exact frame boundaries so that I can get as close to replicating the desktop high quality media playback experience on the web

@Daiz OK, within the constraints of your use case, I share the requirement. Outside of those constraints, it gets more complicated. Seems from the thread as though that's something we can both agree to.

nigelmegitt commented 6 years ago

There's been some speculation here about thread switching and the impact that may have, and if indeed there are multiple threads executing the script and therefore processing the event queue. It's always been my understanding that the script is only executed in a single thread. Can anyone clarify this point, perhaps a browser implementer?

boushley commented 6 years ago

Throwing my hat in the ring here with a couple alternative use cases. As background my company manages a large amount of police body camera video. We support redaction of video via the browser, as well as playback of evidence via the browser.

For the redaction and evidence playback use cases our customers want the ability to step through a video frame-by-frame. If you assume a constant framerate and are able to determine that framerate out of band then you can get something that approximates frame-by-frame seek. However there are many scenarios (be it rounding of the currentTime value, or encoder delay that renders a frame a few ms late) that can result in a frame being skipped (which is a big worry for our customers). There are hacks around this (rendering frames on the server and shipping down frame by frame view) but all the info we need is already in the browser, it would be great if we had the ability to progress through a video frame by frame.

For redaction we have a use case that is similar to the subtitles sync issue. When users are in the editing phase of redaction we do a preview of what will be redacted where we need JS controlled objects to be synced with the video as tightly as we can. In this use case it's slightly easier than subtitles because when playing back at normal speed (or 2x or 4x) redaction users are usually ok with some slight de-sync. If they see something concerning they usually pause the video and then investigate it frame-by-frame.

Some of the suggested solutions, like currentFrameTime, could be extended to enable the frame-by-frame use case.

tidoust commented 6 years ago

@boushley Thanks, that is useful! From a user experience perspective, how would the frame-by-frame stepping work in your case, ideally?

  1. The user activates frame-by-frame stepping. Video playback is paused. The user controls which frame to render and when a new frame needs to be rendered (e.g. with a button or arrow keys). Under the hoods, the page seeks to the right frame, and video playback is effectively paused during the whole time.
  2. The user activates frame-by-frame stepping. The video moves from one frame to the other in slow motion without user interaction. Under the hoods, the page does that by setting playbackRate to some low value such as 0.1, and the user agent is responsible for playing back the video at that speed.

In both cases, it seems indeed hard to do frame by frame stepping without exposing the current frame/presentation time, and allowing the app to set it to some value to account for cases where the video uses variable framerate.

It seems harder to guarantee precision in 2. as seen in this thread [1], but perhaps that's doable when video is played back at low speed?

[1] https://github.com/w3c/media-and-entertainment/issues/4#issuecomment-396701652

dholroyd commented 6 years ago

I note that this thread started with Non-Linear Editors. If someone can elaborate on scenarios there and why frame numbers are needed there, that would be great! Supposing they are, does the application need to know the exact frame being rendered during media playback or is is good enough if that number is only exact when the media is paused/seeked?

We also perform media manipulation server-side on the basis of users choosing points in the media timeline in a browser-based GUI. Knowing exactly what the user is seeing when the media is paused is critical.

Challenges that we've found with current in-browser capabilities include,

We currently do frame-stepping by giving the js knowledge (out of band) of the frame-rate and seeking in frame-sized steps.

Users also want to be able to step backwards and forwards by many frames at a time (e.g. hold 'shift' to skip faster). That's currently implemented by just seeking in larger steps.

boushley commented 6 years ago

@tidoust our current experience is that the user has a skip ahead / skip back X seconds control. When they pause that changes to a frame forward / frame back control. So we're definitely looking at use case 1. And if you're going for playback at something like 1/10 of normal speed (or 3-6 fps) you can pretty easily pull that off in JS if you have a way of progressing to the next frame or previous frame. This use case feel like it should be easily doable, although I think it'll be interesting if we can do it in a way that enables other use cases as well.

@dholroyd we've definitely seen some of these off by a single frame issues in our redaction setup. Would be great if there was a better way of identifying and linking between a web client and a backend process manipulating the video. I believe one of the keys for the editing style use case is that while we want playback to be as accurate as possible, the key is that when paused it needs to be exactly accurate.

mfoltzgoogle commented 6 years ago

@Diaz I spoke with the TL of Chrome's video stack and they gave me a pointer to an implementation that you can play around with now.

First, behind --enable-experimental-canvas-feature, are some additional attributes on HTMLVideoElement that contain metadata about frames uploaded as WebGL textures, including timestamp. [1]

The longer term plan is a WebGL extension to expose this data [2], and implementation has begun [3] but I am not sure of its status.

I agree there are use cases outside of WebGL upload for accurate frame timing data, and it should be possible to provide it on HTMLVideoElement's that are not uploaded as textures. However, if the canvas/WebGL solution works for you, then that makes a stronger case to expose it elsewhere.

Note that any solution may be racy with vsync depending on the implementation and it may be off by 16ms depending on where vsync happens in relation to the video frame rendering and the execution of rAF.

That's really all the help I can provide at this time. There are many other use cases and scenarios discussed here that I don't have time to address or investigate them right now.

Thanks.

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=639174 [2] https://www.khronos.org/registry/webgl/extensions/proposals/WEBGL_video_texture/ [3] https://bugs.chromium.org/p/chromium/issues/detail?id=776222

chrisn commented 6 years ago

This is a great discussion, identifying a number of different use cases. I suggest that the next step is to consolidate this into an explainer document that describes each use case and identifies any spec gaps or current implementation limitations. A simple markdown document in this repo would be fine. Would anyone like to start such a document?

KilroyHughes commented 6 years ago

One detail for such a document (I'm not volunteering to write) is video frame reordering. Widely deployed video codecs such as AVC reorder and often offset the presentation time of pictures relative to their order and timing in the compressed bitstream. For instance, frames 1, 2, 3, 4 in the compressed stream might be displayed in order e.g. 2, 1, 4, 3 and presentation time can be delayed several frames. Frame rate changes are not unusual in adaptively streamed video. Operations such as seeking, editing, and splicing of the compressed stream, e.g. in an MSE buffer, do not happen at the presentation times often assumed. Audio, TTML, HTML, events, etc. must take presentation reordering and delay into account for frame accurate synchronization at some "composition point" in the media pipeline.

nigelmegitt commented 6 years ago

@KilroyHughes I've always made the assumption that all those events are related to the post-decode (and therefore post-reordering) output. It would make no sense to address out of order frame counts from the compressed bitstream in specifications whose event time markers relate to generic video streams and for which video codecs are out of scope.

Certainly in TTML, the assumption is that there is a well defined media timeline against which times in the media timebase can be related; taking DASH/MP4 as an example, the track fragment decode time as modified by the presentation time offset provides that definition.

I'd push back quite strongly against any requirement to traverse the architectural layers and impose content changes on a resource like a subtitle document, whether it is provided in-band or out-of-band, just to take into account a specific set of video encoding characteristics.

nigelmegitt commented 5 years ago

There's a Chromium bug about synchronisation accuracy of Text Track Cue onenter() and onexit() events in the context of WebVTT at https://bugs.chromium.org/p/chromium/issues/detail?id=576310 and another (originally from me, via @beaufortfrancois) asking for developer input on the feasibility of reducing the accuracy threshold in the spec from the current 250ms, at https://bugs.chromium.org/p/chromium/issues/detail?id=907459 .

1c7 commented 5 years ago

Because this thread is way too long. I didn't read them all. Let me provide one more use case

Subtitle Editing software

I want to build a Subtitle Editing software using Electron.js because Aegisub is not good enough. (hotkey, night mode, etc)

The point is:

I want build something that simple but able to improve one part of workflow. not aim to replace Aegisub. because they have way to many feature.

So

Frame by frame and precise control to millisecond like 00:00:12:333 is important.

Here is my design (it's screenshot from Invision Studio, not an actually desktop)

image

I design many version because I want this to be beautiful

image

Here is Electron app (an actually working app)

image

as you can see the Electron app is still Work in Progress. half-built.

and now I found out there are no Frame by frame and precise control to millisecond like 00:00:12:333 image which is very bad..

Conclusion

Use some hack like <canvas> OR just abandon Web tech(HTML/CSS/JS) Electron.js just build Native app (OC & Swift on XCode)

tidoust commented 4 years ago

A couple of updates:

  1. The Media & Entertainment IG discussed the issue at TPAC. Also see the Frame accurate synchronization slides I presented to guide the discussion.

  2. Also, for frame accuracy rendering scenarios (during playback), note the recent proposal to extend HTMLVideoElement with a requestAnimationFrame function to allow web authors to identify when and which frame has been presented for composition:

nigelmegitt commented 4 years ago

This issue was originally raised for the general HTML media element, and the discussion has mainly been about video elements. I have just come upon another use case, for audio elements. Setting currentTime on an audio element whose resource is a WAV file works well. However when the resource is an MP3 file the accuracy is very poor (I checked on Chrome and Firefox).

I'm pretty sure the cause is something that occurs in general with compressed media, either audio or video: depending on the file format, it can be complex to work out where in the compressed media to seek to in order to get to an arbitrary desired point. I guess some kind of heuristic is generally used.

When there are no timestamps within the compressed media, that's even harder, and of course such timestamps would reduce the efficiency of the compression. Effectively the only way to do it reliably is to play back the entire media, which might be very long, and generate a map that connects audio sample count to file location.

Clearly doing that would be a costly operation, in general. Nevertheless, perhaps there is some processing that can be done to try to improve the heuristics, without doing a full decode? An API call to pre-process the media to generate such a map could provide an opt-in for applications that need it, without imposing it on those applications that do not need it.

MDN doesn't really hint about the seek accuracy of audio codecs at https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Audio_codecs and it looks like the HTMLMediaElement interface itself doesn't offer this kind of accurate seek preparation; there is perhaps an analogy with the preload attribute that defines how much data to load, but it is clearly a different thing.

nigelmegitt commented 4 years ago

An example of this audio seeking accuracy issue can be observed at https://bbc.github.io/Adhere/ (in Firefox or Chrome, definitely) by loading the Adhere demo video and comparing the experience loading and playing the demo TTML2 "Adhere demo with pre-recorded audio, single WAV file" with the single MP3 file version. The playback start times are very different in the two cases. There seems to be an end effect sometimes too, but that's something else.

Laurian commented 4 years ago

I've seen this issue with MP3 before and it is always with the VBR ones, CBR worked fine (but most MP3 are VBR)

mpck adhere_demo_audio.mp3
SUMMARY: adhere_demo_audio.mp3
    version                       MPEG v2.0
    layer                         3
    average bitrate               59527 bps (VBR)
    samplerate                    22050 Hz
    frames                        2417
    time                          1:03.137
    unidentified                  0 b (0%)
    errors                        none
    result                        Ok
nigelmegitt commented 4 years ago

Thanks for the extra analysis @Laurian . I suspect you're right that MP3 is a particular offender, but we should not focus on one format specifically, but on the more general problem that for some media encodings it can be difficult to seek accurately, and look for a solution that might work more widely.

Typically I think implementers have gone down the route of finding some detailed specifications of media types that work for their particular application. In the web context it seems to me that we need something that would work widely. The two approaches I can think of so far that might work are:

  1. Categorise the available media types as "accurately seekable" and "not accurately seekable" and have something help scripts discover which one they have at runtime, depending on UA capabilities, so they can take some appropriate action.
  2. Add a new interface that requests UAs to pre-process media in advance in preparation for accurate seeking, even if that is a costly operation. This seems better to me than an API for "no really please do seek accurately to this time" because that would have an arbitrary performance penalty that would be hard to predict, so not great for editing applications if performance is desirable.
chrisn commented 4 years ago

Nigel, I'm not seeing the difference in the demo you shared. With either MP3 or WAV selected, playback starts at time zero. I must be doing something wrong..?

nigelmegitt commented 4 years ago

@chrisn listen to the audio description clips as they play back - the words you hear should match the text that shows under the video area, but they don't, especially for the MP3 version.

giuliogatto commented 4 years ago

@nigelmegitt good work! I can't find the BBC Adhere repo anymore.. was it moved or removed?

nigelmegitt commented 4 years ago

@giuliogatto unfortunately the repo itself is still not open - we're tidying some bits up before making it open source, so please bear with us. It's taking us a while to get around to alongside other priorities 😔

giuliogatto commented 4 years ago

@nigelmegitt ok thanks! Keep up the good work!

1c7 commented 4 years ago

@Daiz I saw a new methods here: https://stackoverflow.com/questions/60645390/nodejs-ffmpeg-play-video-at-specific-time-and-stream-it-to-client

How

  1. Use ffmpeg to live stream local video
  2. Use Electron.js to display live stream video

Do you think it's possible to use this way to achive subtitle display (with near-perfect sync)?

I haven't experiment this myself, so I am not sure if it work

I was thinking build this project: https://github.com/1c7/Subtitle-Timeline-Editor/blob/master/README-in-English.md

in Swift & OC & SwiftUI for mac-only desktop app but seem live ffmpeg+electron.js live stream is somewhat possible too

1c7 commented 4 years ago

One more possible way do to it. (for Desktop)

If building a desktop app with electron.js

node-mpv can be used to control a local version mpv

so load subtitle and display subtitle is doable (.ass is fine)
and edit it and then reload subtitle is also possible. frame to frame playback with left arrow key and right arrow key is also possible.

Node.js code

const mpvAPI = require('node-mpv');
const mpv = new mpvAPI({},
    [
        "--autofit=50%", // initial windows size
    ]);

mpv.start()
    .then(() => {
        // video
        return mpv.load('/Users/remote_edit/Documents/1111.mp4')
    })
    .then(() => {
        // subtitle
        return mpv.addSubtitles('/Users/remote_edit/Documents/1111.ass')
    })
    .then(() => {
        return mpv
    })
    // this catches every arror from above
    .catch((error) => {
        console.log(error);
    });

// This will bind this function to the stopped event
mpv.on('stopped', () => {
    console.log("Your favorite song just finished, let's start it again!");
    // mpv.loadFile('/path/to/your/favorite/song.mp3');
});

package.json

{
  "name": "test-mpv-node",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "node-mpv": "^2.0.0-beta.0"
  }
}

Conclusion

nigelmegitt commented 2 years ago

@nigelmegitt ok thanks! Keep up the good work!

Apologies, forgot to update this thread: the library part of the Adhere project was moved to https://github.com/bbc/adhere-lib/ so that we could open it up.

tobiasBora commented 2 years ago

Just, to make it clear, if I do video.currentTime = frame * framerate, do I have a guarantee that the video will indeed seek to the appropriate frame? I understand that reading from currentTime is not realiable, but I would expect that writing to currentTime is. From my experience, doing video.currentTime = frame * framerate + 0.0001 seems to work quite reliably (not sure if the 0.0001 is needed), but I'd like to be sure I'm not missing subtle edge cases.

chrisn commented 2 years ago

As a next step, I suggest that we summarise this thread into a short document that covers the use cases and current limitations. It should take into account what can be achieved using new APIs such as WebCodecs and requestVideoFrameCallback, and be based on practical experience.

This thread includes discussion of frame accurate seeking and frame accurate rendering of content, so I suggest that the document includes both, for completeness.

Is anyone interested in helping to do this? Specifically, we'd be looking for someone who could edit such a document.

tobiasBora commented 2 years ago

This would be really cool to have guarantees on how to reach a special frame. For instance, I was thinking that:

this.video.currentTime = (frame / this.framerate) + 0.00001;

was always reaching the accurate frame... But it turns out it's not! (at least not using Chromium 95.0) Sometimes, I need a larger value for the additional term, like at least on one frame, I needed to do:

this.video.currentTime = (frame / this.framerate) + 0.001;

(this appear to fail for me when trying to reach for instance the frame 1949 of a 24fps video) similarly, reading out the current time.

Edit: similarly, reading this.video.currentTime (even when paused using requestVideoFrameCallback) seems to be not frame accurate.

tomasklaen commented 1 year ago

It's way worse for me. I've made a 20 fps testing video, where seeking currentTime to 0.05 should display the 2nd frame, but I have to go all the way to 0.072 for it to finally flip.

This makes it impossible to implement frame accurate video cutting/editing tools, as the time ffmpeg needs to seek to a frame is always a lot different than what video element needs to display it, and trying to add or subtract these arbitrary numbers just feels like a different kind of footgun.

bhack commented 4 months ago

What is the state of the art of this? Can this currently be achieved only with WEBCodecs API?

mzur commented 4 months ago

Here is an approach that uses requestVideoFrameCallback() as a workaround to seek to the next/previous frame: https://github.com/angrycoding/requestVideoFrameCallback-prev-next

bhack commented 4 months ago

Is that one really working? Cause on https://web.dev/articles/requestvideoframecallback-rvfc:

Note: Unfortunately, the video element does not guarantee frame-accurate seeking. This has been an ongoing subject of discussion. The WebCodecs API allows for frame accurate applications.

JonnyBurger commented 4 months ago

The technique by @mzur leads to better accuracy, but in our experience doesn't lead to perfect results always either.