[Track Stats API] track.getFrameStats() allocates memory, adding to the GC pile

w3c / mediacapture-extensions

Extensions to Media Capture and Streams by the WebRTC Working Group

https://w3c.github.io/mediacapture-extensions/

Other

21 stars 16 forks source link

[Track Stats API] track.getFrameStats() allocates memory, adding to the GC pile #98

Closed jan-ivar closed 1 year ago

jan-ivar commented 1 year ago

track.getFrameStats() (getStats after #97) creates a new JS object with properties to be read, adding to the GC pile every time it's called, which may be dozens of times per second, per track.

Feedback from @padenot is that real-time JS apps try to avoid garbarge collection whenever possible, and that allocating an object just to read values seems inherently unnecessary. Other ways to read real-time values exist in the Media WG, which we may wish to consider. E.g. audioContext.outputLatency.

Why it allocates

The existing async API queues a single task when the application requests data, creating a garbage-collectable dictionary to hold the results. In contrast, a synchronous getter, we thought, would require the UA to queue tasks continuously to update its main-thread internal slot, just in case the JS app decides to call it, even if it never does.

Ways to avoid it

The Media WG shows there are other ways. outputLatency is synchronous without an internal slot. In Firefox, it uses a lock, but can be implemented without. The Media WG decided to violate § 5.2. Preserve run-to-completion semantics here, which I don't like, but that seems trivial to remedy by caching reads.

To further reduce overhead, we could put getters on a dedicated interface instead of on the track itself, and have a lazy getter for that (sub) interface attribute. TL;DR:

const {deliveredFrames, discardedFrames, totalFrames} = track.videoStats;

const {
  totalFrameDuration,
  droppedFrameDuration,
  droppedFrameEvents,
  framesCaptured,
  totalCaptureDelay
} = track.audioStats; // TypeError if wrong kind

An implementation could do a read-one-read-all approach to maintain consistency between values. WDYT?

henbos commented 1 year ago

The API shape (sync vs async) is a separate question from which metrics it returns.

Regardless of the outcome of this issue, can I merge https://github.com/w3c/mediacapture-extensions/pull/97 as this seems ortogonal?

The API shape is something that we already got WG consensus on. The track.getFrameStats() API was discussed during multiple Virtual Interims, including if it should return a promise or not. The reason we made this async is we wanted to make as little assumptions about the browser stack as possible. For example both Safari and Chrome has separate capture processes, so this is not just a mutex problem but an IPC problem as well.

I invision this method being called once per second per track, if this can be abused it might be appropriate to throttle or cache queries.

But if GC is a concern we should take seriously, how about storing the result of the (async) query in an interface like so?

const stats = new MediaStreamTrackStats();  // interface for holding the result
for many tracks:
  await track.getStats(stats);
  console.log(stats);

jan-ivar commented 1 year ago

Consensus must be confirmed on the list, so we can receive broader input. Adding audio statistics I think warrant input from audio experts. IPC is an interesting problem, but how does Chrome and Safari implement APIs like audioContext.outputLatency?

Regarding the API shape you propose, it seems user-error prone. E.g. what would the following produce?

const stats = new MediaStreamTrackStats();
await Promise.all([trackA.getStats(stats), trackB.getStats(stats)]);
console.log(stats); // trackA or trackB stats?

May I suggest, if I understand the concern, that the following seems sufficient?

await track.audioStats.refresh(); 
console.log(track.audioStats);

But even then, I think we should first make sure user agents can't implement the following performantly:

console.log(track.audioStats);

It seems to me a user agent could detect repeated reads and optimize for accuracy then.

karlt commented 1 year ago

The Media WG decided to violate § 5.2. Preserve run-to-completion semantics here, which I don't like, but that seems trivial to remedy by caching reads.

A cache would emulate run-to-completion to some extent, but it would not provide consistency with other APIs or objects like run-to-completion would. Calling getStats() on two different MediaTracks during a single JS task, for example, would not necessarily provide stats from the same point in time. Arguably that would not be a disadvantage when compared with a Promise-based API with parallel steps, but a Promise-based API provides some indication that results from successive calls may not be from a consistent point in time.

henbos commented 1 year ago

Do we have a sense of the threshold number of JS objects per second would constitute a performance concern for a real-time application? In the envisioned use cases, this API producing more than 50 objects per second is on the extreme end of the spectrum.

padenot commented 1 year ago

I think we will reach resolution on this matter reframing the discussion around use-cases, as @henbos rightfully points out.

It seems like the initial use-case was about being able to assess quality (and please correct me if I'm wrong -- I'm late to the discussion and I have probably missed a number of discussions).

There are numbers of statistics that are proposed:

For audio:
- The input latency: the duration from the point in time the signal is captured by the capture device to the point in time it's available to script. This latency varies in time, because the audio input device and the script are most of the time not in the same clock domain. OS and/or browser need to do something about the clock drifts, and this can mean buffering and/or resampling and/or frame dropping and/or frame insertion (implementations can use a variety of techniques here, and historically have been able to compete on the quality of the implementation, namely in terms of glitches, latency stability and overall latency duration). It is important for some use case to have this value at each time, and not aggregated (unlike in #97, and the aggregation as it is doesn't make sense anyway).
- The number of audio frames that have not reached the script, despite having been captured by the underlying system (often called dropped frame). Those frames can have been dropped by the OS or the browser engine. It would be more useful for this to represent a number that doesn't count frames dropped by the clock drift correction technique in the previous point, and instead be solely about performance. Maybe there is value in having different counters here if it's important to distinguish between the OS dropping frames and the engine dropping frames.
- The total number of frames captured, this is useful to derive a relative quality metric using the number of frames dropped.
- The number of frame drop events, I'm not sure why, what the use case is.
- The total audio frame duration, this is equivalent to the number of frames successfully captured, but is independent of the sample-rate, it can be useful I suppose.
For video:
- The number of video frames delivered, but this is in fact the number of frames received, that can be forwarded downstream
- The number of video frames discarded to achieve a target frame rate
- The total number of video frames that this source has seen. By subtracting the sum of the first two number to this number, it's possible to know the number of frames dropped, that impact quality.

For each of those numbers (audio and video), it's important to know why they exist, to then be able to know how to design the API, here's what I'd use them for. In addition, an estimation of the frequency at which those APIs might be called:

input latency: this is important to implement an AEC, synchronize the A/V of outbound streams, and be able to implement multi-track recording (e.g. in a digital audio workstation, being able to synchronize a track that's being recorded over a backing track that's being played back). This is called rather frequently, because existing implementations can have an input latency that vary over time (this is true in Firefox if using multiple audio input devices simultaneously, but also other implementations). This number is best used not on the main thread however. The Audio Working group would also like to be able to have the audio input latency from within the audio real-time thread, but that's probably for another design iteration.
number of audio frames that have not reached the script: we can use this to warn users that there is a quality issue on the input, and that the recording might not be perfect. Web Apps can use this to try to back-off from using expensing audio processing, or alert that the CPU load might be high, asking the user to maybe close another program or tab. This is probably queries at requestAnimationFrame rate in any audio software that want to alert users that something is wrong and that e.g. the recording might have glitches.
total number of frames: we can use this to compute a percentage value with the previous number. I'll note that it's important to be able to reset those counters, because doing absolute percentages over the duration of e.g. a call are not a good quality metric in the presence of load spike or other transient problems. Same frequency as the previous numbers.
number of frame drop events: this seems to be a crude aggregation of the above. I don't know about the frequency.
total audio frame duration: this is a number derived from another value above, not strictly necessary. I don't know about the frequency either.
number of video frames delivered: this is just a counter, not useful on its own.
number of video frames discarded: this is something that can be computed based on other numbers (namely the native frame rate and the target frame rate).
total number of video frames: this is the number of frames that this source has seen. If we do a computation on those three numbers, we can finally have the information we care about: is the pipeline struggling to keep up yes or no? I'd call all those numbers at requestAnimationFrame rate, or at the video capture rate, to be able to warn users that something isn't going smoothly, and the recording won't be perfect.

Please either agree or propose your own use-case and frequency. Of course, the use-case that is the hardest will guide the API design, because it's always possible to implement simpler use-cases on top.

Now, in terms of possible API design, we're talking about getting a small number of integers or floating point values from somewhere in the web browser to either the main thread or a worker thread, possibly cross-process. This information is metadata about the data that flows. The data that flows is order of magnitudes bigger than what we need here, and flows at a rate that is a lot higher than any of the calling rate that has been mentioned.

Updating those values unconditionally per track is not going to impact the performance profile of the implementation. In particular, some of these values are already shipped with the data (audio latencies, timestamps, etc.) in current implementations. There is no need for synchronicity via promises, because the values are already there. Having an async API is also absolutely not ergonomic, and can even lead to problems if the value is used for synchronization and the event loop is busy.

Finally, we can decide to preserve run-to-completion semantics (like most attributes on the web platform). This can be done by only updating those values during stable state, and updating them all at once.

We can also decide to not preserve run-to-completion semantics (à la e.g. performance.now()), in which case there's nothing to be done.

henbos commented 1 year ago

Please either agree or propose your own use-case and frequency.

We want to poll the API once per second or less. If there are other use cases we can unblock we should discuss that, but I do not see any reason to poll it more frequently. More about this below.

It seems like the initial use-case was about being able to assess quality

Yes.

For quality analysis you want to be able to do something similar to chrome://webrtc-internals/ i.e. plotting graphs in real-time by polling at 1 Hz. In this case you only want aggregation over 1 second intervals. E.g. "avg frame rate over the last second."
For A/B testing (e.g. an app changing frontend or backend code) you want to be able to do A/B experiments and see if you're improving or regressing quality, polling once every 10 seconds may suffice or even less frequently, e.g. "avg frame rate over the entire call."

Detecting quality issues and notifying the user is also a possibility, but this can be done at 1 Hz or less frequently.

It is important for some use case to have this value at each time, and not aggregated

Aggregation does not exclude getting a value each time.

Aggregated metrics allow you to get averages over any time period because you can calculate averages as "delta metric / delta time". If you want the average frame rate every time you call getStats() at any desired frequency and apply the formula (currStats.deliveredFrames - prevStats.deliveredFrames) / (currStats.timestamp - prevStats.timestamp). This works whether you want to poll it every second or every minute.

This is a lesson learned from WebRTC getStats() API: when we exposed instantaneous values (e.g. "bitrate" instead of total "bytesSent"), apps got into this bad habit of polling the API at super high frequency just to make sure they didn't miss any dips. When the API was standardized we used aggregated counters, and apps could reduce polling frequency and still detect dips even when polling once every few seconds.

This is important for performance. Not having to poll APIs unnecessarily often gives the CPU more opportunity to reach idle power states and is a generally a good idea to reduce CPU frequency scaling. Reducing scheduling has been a successful performance optimization strategy in Chrome, so I would prefer we don't encourage apps to poll the API more often than needed by exposing instantaneous values. (This is primarily of concern when the app has its own timer, and less of a concern when the app is already responding to existing events in which case the CPU is not idle anyway.)

The number of frame drop events, I'm not sure why, what the use case is.

Frame drop events are needed to tell the difference between if there was one big drop and then we recovered or if there was continuously smaller glitches happening evenly throughout the call, for instance.

If we don't provide this for the app, the app may poll the API 10 times per second instead, preventing CPU idling.

input latency: this is important to implement an AEC, synchronize the A/V of outbound streams, and be able to implement multi-track recording (e.g. in a digital audio workstation, being able to synchronize a track that's being recorded over a backing track that's being played back). This is called rather frequently, ...

For the use cases I've mentioned this does not need to be called frequently because you can using the "delta total latency / delta total samples" compute the average latency experienced for the desired time interval (e.g. 1 second or 10 seconds).

But if this can unblock more use cases then that should be discussed. AEC and A/V sync for instance has traditionally been solved by the browser (at least for WebRTC), but the trend seems to be to let the app to control more and more of the stack.

number of audio frames that have not reached the script [...] This is probably queries at requestAnimationFrame rate [...] in any audio software that want to alert users that something is wrong [...]

total number of video frames [...] I'd call all those numbers at requestAnimationFrame rate, or at the video capture rate, to be able to warn users that something isn't going smoothly [...]

Why would you want to poll it in every requestAnimationFrame? This is a real-time application, so I think you need to establish that more than one frame has gone missing before that warrants notifying the user. Besides, unless the intended user is Spider-Man, I don't think they need to react to a problem on the very same frame that a problem occurred. Here, too, I imagine you'd want to poll every 1 or every 10 seconds. Glitches down to the millisecond level are not actionable. Variation is to be expected. But if a significant portion go missing over an extended period of time, then we need to act...

total number of frames: [...] I'll note that it's important to be able to reset those counters [...]

As previously mentioned, the "delta X / delta Y" pattern means you never have to reset any counters because you are already in full control of what interval you are looking at.

This is actually another lesson learned from the WebRTC getStats() API: different parts of the code may be interested in metrics for different reasons or under different circumstances. We don't want calling the API to have side-effects such as resetting the counters, because then that means that one consumer of the API affects all other consumers of the API. This has downsides without upsides.

Now, in terms of possible API design, we're talking about getting a small number of integers or floating point values from somewhere in the web browser to either the main thread or a worker thread, possibly cross-process. This information is metadata about the data that flows. The data that flows is order of magnitudes bigger than what we need here, and flows at a rate that is a lot higher than any of the calling rate that has been mentioned.

This is a good point. If you "piggyback" metadata onto the data that is either way flowing then you don't have to increase the number of IPC calls. But what if the browser does both capturing and rendering in the GPU process? Then there are no IPC calls to the JavaScript process. Such browser might throttle the frequency of IPC updates, and then the synchronous API is no more powerful than a throttled promise-based API.

But if this is a moot point, then synchronous vs asynchronous might not matter as much as I have been thinking it does. It's hard to have an opinion without knowing everyone's implementation.

I think one of the reasons I hesitate with synchronous API is that at 1 Hz polling frequency, it is hard to imagine the GC issue causing any real problems, but I've been surprised before.

henbos commented 1 year ago

TL;DR version:

1 Hz or less is the intended use case, but if the same API can solve multiple problems we should consider that.
Aggregated metrics allows averages over any time period which has the benefit of not encouraging sub-second polling, not needing to reset counters, and allowing the app to decide any polling freqeuncy.
Maybe the synchronous API is something we can do, but let's make sure we take everyone's browser stack into account.

alvestrand commented 1 year ago

1) The getStats() design is based on the idea that getting consistent values of all counters, together with the timestamp of when they are measured, is critical. 2) The values (as implemented in Chrome) are stored in a different thread than the main JS thread. So getting them involves IPC. This militates strongly against both a synchronous API (and cache-on-first-read is still synchronous) and calling the API at very high frequency (certainly more than once per frame / sample block; preferably a lot lower than that (we do IPC per frame anyway)). 3) If GC is a main concern, the solution should be based on BYOB, not on changing the threadedness, something like:

valueBuffer = await track.getStats(valueBuffer);

This may mean defining the value buffer as some other object than "dictionary"; I don't know how BYOB semantics works with other APIs.

youennf commented 1 year ago

The getStats() design is based on the idea that getting consistent values of all counters, together with the timestamp of when they are measured, is critical.

Can you explain which use cases make this critical? I see the stats API as a way to build aggregated data that will allow assessing overall quality.

FWIW, the current promise-based no-byob approach seems ok to me. It is consistent with existing MMediaStreamTrack API, which is not a realtime API, as can be seen from the usage of events and promises in this API.

For things like input latency/AEC, I would expect the API to be exposed where the actual processing is happening (say Audio Worklet, or MediaStreamTrackProcessor or VideoFrame...), not through MediaStreamTrack.

henbos commented 1 year ago

Getting multiple metrics at the same time is important for when subtracting one value from the other, if taken from different points in time you’ll get errors. Timestamp is needed for ”delta seconds” when calculating rates (e.g. frames per second).

For things like input latency/AEC, I would expect the API to be exposed where the actual processing is happening (say Audio Worklet, or MediaStreamTrackProcessor or VideoFrame...), not through MediaStreamTrack.

I think these use cases also should happen on real-time threads and with different metrics than has been discussed for adding to track, so I’m not sure how much overlap there is between the different use cases discussed.

jan-ivar commented 1 year ago

Some points I haven't seen mentioned:

Both aggregate and instantaneous values seem implementable either way, thus orthogonal to the discussion
Consistency between values seems implementable either way — regardless of approach, a UA can use a get-one-get-all approach under the hood to expose a consistent record of values per task — thus orthogonal to topic
Retrieval from a different thread doesn't involve IPC which would only be needed for out-of-process state (which I'm not hearing anyone claiming they have)
What is and isn't "real-time" seems to lack a definition. MST can be transferred to workers ("real-time threads") and seems just as real-time as e.g. the live HTMLMediaElement object with its currentTime attribute.
Short-lived objects may be a smaller hit in implementations that implement GC nurseries

IOW, I don't think the facts support there being a big difference either way implementation-wise, and that much of this seems to come down to choice and consistency, where the Media WG and WebRTC WG have diverged. MediaCapture is sort of in the middle — based on WebRTC's unsuccessful attempts to pawn it over to the Media WG — so it makes sense for this to come up here.

We seem to be arguing about § Consider whether objects should be live or static, and I'd rather we decide based on end-user ergonomics than implementation difficulties I'm not seeing. This also seems like a more general discussion over API — future of attributes? — so it might be good to solicit advice from a broader audience.

henbos commented 1 year ago

Agree let's get back to the original topic. But I do think aggregate counters are an important feature because as we learned by pc.getStats(), we need to be able to calculate average rates over user-specified time intervals.

We seem to be arguing about § Consider whether objects should be live or static, and I'd rather we decide based on end-user ergonomics than implementation difficulties I'm not seeing.

I think currStats.foo - prevStats.foo is an important feature, which necessitates that at least prevStats is a static object (or at least static since the last time it was refreshed).

I like track.audioStats.refresh(), but prevStats = track.audioStats doesn't work if audioStats is an interface rather than a dictionary. That's why I think track.getStats(result) seems the most straight-forward, but if you're concerned this is user-error prone, how about marrying the two proposals into this third option?

// Initiallly
let prevStats = new MediaStreamTrackStats();
let currStats = new MediaStreamTrackStats();

// Periodically
await currStats.refresh(track);
console.log(
    "FPS: " +
    (currStats.frames - prevStats.frames) / (currStats.timestamp - prevStats.timestamp) * 1000;
[prevStats, currStats] = [currStats, prevStats];

What do you think?

youennf commented 1 year ago

3. Retrieval from a different thread doesn't involve IPC which would only be needed for out-of-process state (which I'm not hearing anyone claiming they have)

I haven't looked precisely at the implementation strategy we would use for those stats. The audio unit in WebKit is living out of process so I would not initially forbid IPC based strategies.

4. MST can be transferred to workers ("real-time threads") and seems just as real-time as e.g. the live HTMLMediaElement object with its currentTime attribute.

MST can be transferred to workers, not audio worklets, which are really real time threads where GC might be a bigger issue. In general, I see these stats API as a convenient way to get information you could gather from either audio worklet plugged to MST or to MediaStreamTrackProcessor+VideoFrames by getting individual chunks, checking timestamps, counters...

This also seems like a more general discussion over API — future of attributes? — so it might be good to solicit advice from a broader audience.

Agreed if we start to design things differently than what is being done today.

There are plenty of APIs (say rvfc) which are creating dictionaries/objects for every call, and these calls can be made at 30Hz/60Hz. I do not see this particular API as more expensive/requiring more optimizations than existing APIs.

If we are trying to optimize things, we need to have a clear understanding at what we are trying to achieve with testbed/measurements and so on. If we do not have this, I prefer we stick with existing API design.

Also, BYOB can be quickly broken if we desire to extend stats (new objects, sequences...). I see ease of use (nicely bucketed dictionaries, sequences when it makes sense) and ease of extension as more important than optimization strategies.

jan-ivar commented 1 year ago

... we need to be able to calculate average rates over user-specified time intervals.

FWIW my recent experience outside of WebRTC is folks seem to prefer standard metrics over custom math, but YMMV.

I think currStats.foo - prevStats.foo is an important feature, which necessitates that at least prevStats is a static object (or at least static since the last time it was refreshed).

I don't think it necessitates that. You do:

const {foo} = track.videoStats;
await wait(1000);
const foops = track.videoStats.foo - foo;

But sure if the main use case is bulk data collection to feed some graph library, then static copies may have an edge since they're desired.

await currStats.refresh(track); ... What do you think?

A refresh method (which I introduced) is not without usability pitfalls either, the main one being unaware it exists. After my last comment, I'm not convinced implementation concerns should outweigh ergonomics here.

The audio unit in WebKit is living out of process so I would not initially forbid IPC based strategies.

Are downstream audio sinks also out of process? If getting these metrics is costly, are we adding them in the wrong place? Having JS trivially trigger IPC repeatedly seems like a bad idea.

In general, I see these stats API as a convenient way to get information you could gather from either audio worklet plugged to MST or to MediaStreamTrackProcessor+VideoFrames by getting individual chunks, checking timestamps, counters...

Convenience seems a weak argument. If the information is already available, why do we need a new API?

Agreed if we start to design things differently than what is being done today.

The Web Platform design principles suggest dictionary use for inputs, not outputs, and there's some history suggesting they were primarily designed for that. So I'd describe attributes as the conservative design choice, and dictionary-returning methods as new. It might be useful to have general guidelines for deciding when to use one over the other, and I'm observing that we don't have that.

There are plenty of APIs (say rvfc) which are creating dictionaries/objects for every call, and these calls can be made at 30Hz/60Hz. I do not see this particular API as more expensive/requiring more optimizations than existing APIs.

I find this a compelling argument regarding performance concerns, but not a compelling argument against attributes as the simpler choice.

youennf commented 1 year ago

Are downstream audio sinks also out of process? If getting these metrics is costly, are we adding them in the wrong place? Having JS trivially trigger IPC repeatedly seems like a bad idea.

Most downstream audio sinks are going to the content process in WebKit, but they might not have all data available out of process. For instance, audio drops are not known from audio sinks in WebKit's current implementation.

The same could be said about discardedFrames for camera capture.

I'd describe attributes as the conservative design choice, and dictionary-returning methods as new.

Current APIs seem to induce that both are conservative design choice. We are using dictionary for output where it best fits. One case where it makes sense is when there is a group of different data items that make sense together, and/or when we think having an extension point is desirable.

Looking at MediaTrackFrameStats, we knew from the start that we might add non delivered frame categories in the future. This validates the extension point. The data items are also synchronised in the sense that totalFrames - discardedFrames - deliveredFrames is the number of frames not delivered for other reasons. Hence the desire for a dictionary other 3 separate attributes.

I find this a compelling argument regarding performance concerns, but not a compelling argument against attributes as the simpler choice.

AIUI, the new information provided in this thread is specifically about performance concerns, in particular GC in realtime environments. Albeit this, I am not sure what additional new information there is to change the current API design.

Note again that MST audio stats would not happen in a realtime audio environment since MST is not transferable to audio worklets. Exposing such data in audio worklet is a different problem with different constraints.

jan-ivar commented 1 year ago

Most downstream audio sinks are going to the content process in WebKit, but they might not have all data available out of process. For instance, audio drops are not known from audio sinks in WebKit's current implementation.

Right, I think the argument was this metadata could piggyback on the data without adding much overhead, if it led to more ergonomic web surfaces.

Note again that MST audio stats would not happen in a realtime audio environment since MST is not transferable to audio worklets.

The characterization of the OP feedback as "real-time" was mine, and not meant to disqualify concerns outside of audio worklets. Rather, the OP points out lack of consistency with existing Media WG APIs like audioContext.outputLatency and mediaElement.currentTime which are capable of returning up-to-date information without async methods or dictionaries.

It's worth considering consistency across the web platform and not just within a single WG.

One case where it makes sense is when there is a group of different data items that make sense together, and/or when we think having an extension point is desirable. ... The data items are also synchronised

Grouping and synchronization both seem orthogonal to this discussion, e.g. track.audioStats and track.videoStats are groups, and strategies were proposed to update attributes in lockstep and follow § 5.2.

youennf commented 1 year ago

Sure, what I am saying is that using a dictionary is the current usual way of doing things, developers know it well. The audioStats/videoStats/refresh is a new pattern AFAIK, the main benefit being GC reduction/performance which it seems we decided to leave out of this discussion.

henbos commented 1 year ago

We are trying to expose capture metrics to the main thread or workers, this...

Inherently involves IPC (whether or not we can piggyback).
Is happening in contexts which are NOT real-time.
Is not just convenience, this is new data about capture.

It's worth considering consistency across the web platform and not just within a single WG.

Some differences between the mentioned other APIs and our API:

mediaElement.currentTime is "an approximation of the current playback position" expressed in seconds, it does not need to be updated at high frequency.
audioContext.outputLatency is an estimation, so this not being exact is baked in to the definition, which means it does not need to be updated every 10 ms in a non-audio worklet.

I do see the appeal of synchronous APIs like this, but they are not suitable for metrics that update on every audio sample (e.g. every 10 ms) for measuring things in a separate process. It's funny that this is what we're discussing, considering the issue was about making the API more performant, not less. (And even if we could find an example of something that is bad doesn't mean we should copy it.)

Right, I think the argument was this metadata could piggyback on the data without adding much overhead, if it led to more ergonomic web surfaces.

Please let's not go down this path of assuming we can always piggyback. We are interested in moving consumers that today live on the renderer process to the GPU process for performance reasons, so it is a very real possibility that signals that we could piggyback on today may not be there in the future. We should not bake in assumptions about the browser architecture into the API unless we have to, so for an API that isn't even real-time it seems wrong to bake in the requirement that metrics are pushed in real-time to non-real-time contexts.

FWIW my recent experience outside of WebRTC is folks seem to prefer standard metrics over custom math, but YMMV.

As previously discussed, I think total counters are superior to current rates because it allows us to measure between any interval we desire.

I think some variation of the async APIs we have discussed is superior, in order of personal preference but any of them do it for me: await track.getStats(), await track.getStats(results) or results.refresh(track).

henbos commented 1 year ago

Async APIs: suitable for grabbing snapshots from other contexts, possibly involving IPC. Sync APIs: suitable only if we want estimates such as smoothed values (i.e. not necessarily updated every frame), but I prefer total counters.

youennf commented 1 year ago

Is not just convenience, this is new data about capture.

The point I was trying to make is that, should this info be useful for realtime environments (and it might), we should expose it there too, in a more efficient/synchronized way. This API would then become an optimized convenience.

I tend to prefer await track.getStats(), I do not see real value in await track.getStats(results). results.refresh(track) would require evaluation from other WGs since this is a new approach AFAIK.

jan-ivar commented 1 year ago

Sure, what I am saying is that using a dictionary is the current usual way of doing things, developers know it well.

Are you sure? Do you have examples outside of WebRTC? I couldn't find a comprehensive list so for fun I asked GPT4 and it found only a few odd ones that weren't interfaces and finally said "WebIDL dictionaries are primarily used for passing data or configuration objects as method parameters, and as such, they are less commonly used as return values." 🙂

I think the winner here is attributes.

The audioStats/videoStats/refresh is a new pattern AFAIK

The refresh method is a red herring. See OP. In my mind, the only two ergonomic options are, e.g. for video:

const {deliveredFrames, discardedFrames, totalFrames} = track.videoStats;
const {deliveredFrames, discardedFrames, totalFrames} = await track.getStats();

jan-ivar commented 1 year ago

mediaElement.currentTime is "an approximation of the current playback position" expressed in seconds, it does not need to be updated at high frequency.

It's "an approximation of the current playback position that is kept stable while scripts are running." to respect w § 5.2.

I.e. it's an approximation because it's not allowed to advance while JS is running, not because it's inaccurate.

As you can see here, it updates every ~10 ms in all browsers at what appears to be millisecond accuracy.

No await needed.

jan-ivar commented 1 year ago

There's even a frame counter in the fiddle, though I fear it was never standardized:

const frames = v => v.mozPaintedFrames || v.webkitDecodedFrameCount;

I wonder what the story was there.

henbos commented 1 year ago

In my mind, the only two ergonomic options are, e.g. for video:

const {deliveredFrames, discardedFrames, totalFrames} = track.videoStats;
const {deliveredFrames, discardedFrames, totalFrames} = await track.getStats();

We've narrowed it down :) that's progress. At this point, it's hard to argue that there is a big difference between the two from an ergonomics point of view. Tomato tomato. We have to things we need to decide:

Return a dictionary or an interface?
Promise or not?

From a purely ergonomics point of view I think we want to allow rewiring the above to the following:

const firstStats = track.videoStats OR await track.getStats();
...
const secondStats = track.videoStats OR await track.getStats();
console.log("Discarded frames: " + secondStats.discardedFrames - firstStats.discardedFrames);

This would support dictionary over interface based on ergonomics now that we've concluded GC is no longer a concern. But either will work, it's just that interface is more error-prone.

As you can see here, it updates every ~10 ms in all browsers at what appears to be millisecond accuracy.

I think what's happening is we're piggybacking on existing signals. If we send frames directly from Capture process to GPU process and had no signals to piggyback on, would we still want to update every ~10 ms or would we relax it a bit? I think the answer to that question depends on the use cases: are we supporting real-time or non-real time contexts?

youennf commented 1 year ago

About currentTime, I think there are use cases for this API for synchronising the web content with the media content being rendered, hence why it is aggressively updated.

In the getStats API case, I do not think there is a desire to support any such scenario (we have other APIs for this kind of processing). About dictionary vs. interface, interface seems overkill to me, dictionary is largely sufficient.

jan-ivar commented 1 year ago

const firstStats = track.videoStats OR await track.getStats();
...
const secondStats = track.videoStats OR await track.getStats();
console.log("Discarded frames: " + secondStats.discardedFrames - firstStats.discardedFrames);
This would support dictionary over interface based on ergonomics now that we've concluded GC is no longer a concern. But either will work, it's just that interface is more error-prone.

Glad you remembered await 😏

Hopefully I don't have to argue why synchronous attributes are simpler for developers than a dictionary promise method.

If the extra interface is confusing we can of course put the attributes on the track itself:

const firstDiscardedFrames = track.discardedFrames;
...
console.log(`Discarded frames: ${track.discardedFrames - firstDiscardedFrames}`);

Also note there's no await so we didn't have to yield to other application code just to get this info. We also didn't need the second temporary. KISS

This would support dictionary over interface based on ergonomics now that we've concluded GC is no longer a concern. But either will work, it's just that interface is more error-prone.

I don't think we've concluded GC is no longer a concern. It would clearly happen in one API and not the other. Yes some browsers may be quick to clean up the garbage, and other methods may litter more, but that's no reason to create garbage if it's not helpful. I also don't speak for @padenot.

padenot commented 1 year ago

We are trying to expose capture metrics to the main thread or workers, this...
* Inherently involves IPC (whether or not we can piggyback).

This is not a valid argument, the data has to be in the content process anyway (see the exhaustive breakdown below). It's like saying any network data is out of content processes: it's true, but it's irrelevant because it has to be in the content process eventually, and sooner than later if the implementation is of high quality.

* Is happening in contexts which are NOT real-time.

Real-time is a spectrum. There is hard real-time (like real-time audio processing), and there are softer flavours of real-time, such as graphics rendering at 60Hz. A digital audio workstation is of higher quality if it has this piece of information reliably on-time. Similarly, something that processes and renders video at 60Hz will want to know extremely frequently if there are quality issue, maybe it wants to back-off some expensive processing.

* Is not just convenience, this is new data about capture.

That is correct.

It's worth considering consistency across the web platform and not just within a single WG.

Some differences between the mentioned other APIs and our API:
* mediaElement.[currentTime](https://html.spec.whatwg.org/multipage/media.html#dom-media-currenttime-dev) is "an _approximation_ of the current playback position" expressed in _seconds_, it does not need to be updated at high frequency.

* audioContext.[outputLatency](https://webaudio.github.io/web-audio-api/#dom-audiocontext-outputlatency) is an _estimation_, so this not being exact is baked in to the definition, which means it does not need to be updated every 10 ms in a non-audio worklet.
I do see the appeal of synchronous APIs like this, but they are not suitable for metrics that update on every audio sample (e.g. every 10 ms) for measuring things in a separate process. It's funny that this is what we're discussing, considering the issue was about making the API more performant, not less. (And even if we could find an example of something that is bad doesn't mean we should copy it.)

Those two APIs you're mentioning are both updating at a higher rate than 10ms: an AudioContext's currentTime updates at 128 / AudioContext.sampleRate (e.g. 128 / 44100 ~= 0.003s = 3ms, this is the slowest it can update since the sample rate can be higher). A high quality implementation of outputLatency will update at more or less the same rate (on some OSes, the number is made available from the real-time audio callback, that's also used to drive the clock -- same frequency of update).

Right, I think the argument was this metadata could piggyback on the data without adding much overhead, if it led to more ergonomic web surfaces.

Please let's not go down this path of assuming we can always piggyback. We are interested in moving consumers that today live on the renderer process to the GPU process for performance reasons, so it is a very real possibility that signals that we could piggyback on today may not be there in the future. We should not bake in assumptions about the browser architecture into the API unless we have to, so for an API that isn't even real-time it seems wrong to bake in the requirement that metrics are pushed in real-time to non-real-time contexts.

It's often not useful to make a general statement when we know everything about the specifics of a situation. Here we have complete information, and we can make a good API design that's better for users and developers, that's it.

Please remember the priorities:

User needs come before the needs of web page authors, which come before the needs of user agent implementors, which come before the needs of specification writers, which come before theoretical purity.

Being able to have accurate values quickly leads to better software. Having a sync API with attributes, that doesn't GC is more ergonomic than having a promise API with a dict.

It's not hard for us browser developers to send a few dozen numbers alongside other pieces of information that we have to make available to the content processes, always, and that can never not be made available because script can and do observe those values.

Let's comment on the proposed metrics:

totalFrameDuration: a trivial sum -- always available in the content process, already update at the correct rate
droppedFrameDuration: an integer that's incremented sometimes
droppedFrameEvents: an integer, related to the previous integer, also incremented sometimes
framesCaptured: a trivial sum
totalCaptureDelay: a trivial sum of information already sent over in the content process
delivered video: already present in the content process
discarded video: not present in the content process, but probably derivable
total video frames: already present in the content process for other APIs (https://w3c.github.io/media-playback-quality/#concepts -- this is already a sync API).

In a competitive implementation, you're sending and receiving on the order of 128 * channel-count float32 each 128 / 44100Hz (this translates to sub-10ms roundtrip latency with e.g. a macbook). We're proposing to add a small number of integers, which are metrics about the audio data, each time we're sending this audio data. We're already sending some metadata anyways. For video, we're already sending metrics, and they are already available synchronously.

FWIW my recent experience outside of WebRTC is folks seem to prefer standard metrics over custom math, but YMMV.

As previously discussed, I think total counters are superior to current rates because it allows us to measure between any interval we desire.

This is something we can discuss, but the current way of doing it (as specced in the draft) isn't satisfactory because it masks transient events: it forces authors to rely on time averages, which is inherently a low-pass filter.

I think some variation of the async APIs we have discussed is superior, in order of personal preference but any of them do it for me: await track.getStats(), await track.getStats(results) or results.refresh(track).

async APIs are never superior, always slower, and never more ergonomic than sync APIs when the information is already present and updated in the content process, and it is, here.

padenot commented 1 year ago

About currentTime, I think there are use cases for this API for synchronising the web content with the media content being rendered, hence why it is aggressively updated.

This is precisely why we need frequently updated input and output latency: to synchronize with media content being rendered.

currentTime includes latency, for A/V sync of video playback. Here we want to synchronize audio input and audio output, and potentially visuals, so we need all numbers frequently.

I'll note that there are already commercial and free-software products 0 [2] that go to great length to estimate audio input latency, and it's only best effort (a static number). Those products cannot function without an accurate audio input latency.

[2]: https://wam-studio.i3s.univ-cotedazur.fr/ free software academic project

dontcallmedom-bot commented 1 year ago

This issue was discussed in WebRTC May 2023 meeting – 16 May 2023 (Issue #98 track.getFrameStats() allocates memory, adding to the GC pile)

henbos commented 1 year ago

Sorry for the late response, this was covered in the last interim (slides and recordings available) but I had not read your latest comments at that point. My understanding is that this is an implementation versus ergonomics question and I tried to phrase the discussion around that.

The sense I get from your comments, as well as Jan-Ivar's comment (he tried to convey some of your comments during the meeting), is that the implementation concerns are not valid so the API shape should favor the user. But what are your thoughts on?

1: We neither want to PostTask on every update or have to use a mutex. Because we want to grab multiple metrics at the same time, I don't think there is a third option so in practise we would be forced to implement this with a mutex and a cache which seems undesirable. Agree/disagree?

2: The "overhead from number of updates per second" and "we may be forced to do IPC" concerns that I have stem from wanting to allow implementer flexibility and the fact that I think it is very likely that more metrics will be added in the future. This could include having to process metrics of more tracks (e.g. 54 remote tracks, leading to thousands of updates per second) or new metrics not possible to collect without IPC. Not forcing sync is a way to not paint ourselves into a corner in the future. Thoughts?

During the interim I heard support for async API (Youenn, Harald) but only Mozilla in favor of the sync API (Jan-Ivar). I did also present a compromise, which would be the sync API but to throttle the update frequency of metrics pushes but I didn't hear any takers for that.

Is there something essential that the WG has not talked about yet?

henbos commented 1 year ago

Regarding "this is already available", I think there are cases like AudioContext where the piggybacking only happens if you've opted in to it by using AudioContext, and then there are cases like Media Playout Quality where you can't opt out but these counters also don't necessarily need to run at hundres or thousands of updates per second. Nobody would complain if their playout counters "only" refreshed at the requestAnimationFrame frequency, clearly.

If we approach the sync track stats proposal with that mindset, i.e. same precision as requestAnimationFrame update frequency, then I think we can ensure a future proof API where we can always piggyback and get sufficient precision and performance but still have the API be sync.

WDYT @padenot @youennf ?

henbos commented 1 year ago

If all of this really is all available anyway and I'm making a big fuzz about nothing, we could proceed with sync and if performance every becomes a real concern browsers would be free to optimize, but perhaps it's an implementation detail at that point. But Youenn didn't you say Safari already has IPC constraints here, or did I misinterpret?

youennf commented 1 year ago

Browsers may do different tradeoffs between perf vs. accuracy. If we go with sync, we would probably need to provide clear guidance on refresh rate or we might end up in web compat issues.

And again, this information is not available for free. For instance, Safari rendering processes cannot currently compute some of the metrics you mention, for instance audio drops. We can of course send this additional information to the rendering process but I do not think we would like to do this preemptively nor posting tasks for no other reasons than updating those counters. This would require introducing a new stats object which would trigger active gathering of the stats.

I think we agreed perf is not the main issue here. From a web dev point of view, the async API seems easier to understand than introducing a new object with sync getters. From a browser implementor point of view, an async API is also probably easier to implement.

henbos commented 1 year ago

I think we agreed perf is not the main issue here. From a web dev point of view, the async API seems easier to understand than introducing a new object with sync getters. From a browser implementor point of view, an async API is also probably easier to implement.

jan-ivar commented 1 year ago

From a web dev point of view, the async API seems easier to understand than introducing a new object with sync getters.

Would you say an async API is also easier to understand than sync getters on the track itself (the latest proposal)?

This seems to assume a level of comfort with promises I doubt exists among the average web developer. Without more information to why that might be so, I fear it puts what I consider simple vs hard in JS on its head.

henbos commented 1 year ago

I think being comfortable with promises is a very low bar to ask of a developer working with getUserMedia (which returns a promise).

And for the record, we've conducted several experiments where we remove PostTasks or IPC calls and get measurable CPU% improvements out of it.

youennf commented 1 year ago

sync getters on the track itself (the latest proposal)?

I might have missed that. Do we agree that such an API would introduce a constant cost while the track is live, no matter whether the API is used or not?

I think being comfortable with promises is a very low bar to ask of a developer working with getUserMedia

Agreed.

Would you say an async API is also easier to understand than sync getters on the track itself (the latest proposal)?

I was comparing an async getter vs. a new object with sync getters.

I think it is fair to say that the data we want to expose here changes very often and lives in a separate process. Given the use case is clearly not to do realtime or near realtime processing (say raf/rvfc), using an async API is a good compromise.

henbos commented 1 year ago

I like async because it allows more implementations in theory. But in practise we do pass through the renderer process, so I think this is a matter of mutex versus PostTaskAndReply in Chromium.

henbos commented 1 year ago

Youenn does Safari have its audio frames pass through the renderer process unconditionally?

youennf commented 1 year ago

In Safari, each audio frame is not individually sent to the content process. Instead a ring buffer with shared memory is used.

Camera video frame pixels are not sent to the content process, except in case of sw encoding or VideoFrame.copyTo which are async tasks.

If we add more stats in the future, this will further increase the cost of sending data to the content process without the content process actually using it.

youennf commented 1 year ago

Here we want to synchronize audio input and audio output, and potentially visuals, so we need all numbers frequently.

Audio input / audio output will be done in web audio presumably, so in a worklet where there is no MediaStreamTrack. The proposed API does not seem well suited for this use case AFAIUI.

I understand the desire to have an API to get that info and that this info would be updated every time there is a new audio chunk to process. Should it be a separate API (that could include input and output latency estimates)?

henbos commented 1 year ago

I hold the opinion that synchronous APIs should read from internal slots such that you neither have to use a mutex or maintain a cache in case the API is called multiple times in the same task execution cycle.

I am in favor if sync API if the information is truly available on that thread, either because we are in a context where the information is available anyway (web audio worklet maybe?), or in a context where the app subscribes to an event that fires when the information is available.

I think(?) we all agree that we don't want to PostTask on every frame to the main thread to update metrics, but note that if the app wants to poll the API frequently we still have to post microtasks on ever get() to clear the cache that we're forced to maintain on the main thread. This is what we had to do for getContributingSources().

jan-ivar commented 1 year ago

To simplify discussion, nobody is suggesting PostTask. Queuing lots of task on main-thread would be a terrible implementation, risking queue buildup which might itself cause outdated information.

Implementation would either be lock-less (or a mutex until that can be implemented properly), or however existing sync web APIs mentioned in this thread are accomplished, e.g.v.webkitDecodedFrameCount and v.currentTime.

It might shed some clarity if someone could explain how those existing APIs are possible, and why we cannot follow their existing pattern (one of the metrics requested even appears to be the same: decodedframe count).

Because we want to grab multiple metrics at the same time, I don't think there is a third option

I believe this was answered off-thread, but for completeness, I believe the answer was: a lock-less ring buffer enables discrete access to consistent data sets.

Instead a ring buffer with shared memory is used.

This is what Firefox does as well I believe, which is what allows for lock-less access to metrics accompanying the data.

Camera video frame pixels are not sent to the content process, except in case of sw encoding or VideoFrame.copyTo which are async tasks.

Right, but even if the pixels stay, I understand something representing them is still sent over IPC? The overhead would be to add a minimal set of metrics to that IPC, which seems all that's needed to support these APIs and stay consistent with existing patterns in HTMLVideoElement.

youennf commented 1 year ago

Right, but even if the pixels stay, I understand something representing them is still sent over IPC?

This is the case now. For perf reasons, we could stop doing this in some reasonably frequent cases, say if camera is used solely for local playback and MediaRecorder for instance.

Note that the presented usecase for this kind of data for realtime processing is audio latency, where we do not have IPC for each audio chunk and where an API at MediaStreamTrack level does not seem like a great fit.

In that same vein, HTMLMediaElement.requestVideoFrameCallback might be a better API to expose information useful for realtime processing, or MediaStreamTrackProcessor/VideoFrame streams for video processing.

Given this, I am still interested in understanding more about the desire to shape this API for realtime processing, given it would likely have an impact on performance.

henbos commented 1 year ago

To simplify discussion, nobody is suggesting PostTask.

Except for a microtask to clear the cache every time the getter is used by JS, right?

It might shed some clarity if someone could explain how those existing APIs are possible

If I'm code searching correctly (so IIUC...), this is how Chrome does it:

v.webkitDecodedFrameCount currently uses a lock. The actual data appears to be periodically updated based on IPC messages, at least for some implementations of the statistics update interface. Since no caching is involved here I believe you could poll this API multiple times in the same task execution cycle and get different results, i.e. v.webkitDecodedFrameCount == v.webkitDecodedFrameCount does not hold IIUC.
v.currentTime is "updated periodically" according to a comment... it uses a microtask to say the position needs to be updated to ensure v.currentTime == v.currentTime holds. But IIUC when you use the getter it gets updated on request to the current time. This uses a lock to get the latest frame, which appears to be known because the compositor lives on the renderer.

I can vouch for the fact that currently Chrome is aware of each audio and video frame in the renderer process because of historical implementation decisions. I'm not convinced the implementation would necessarily have to look this way, but Chrome is like this.

So at least in Chrome, we could implement any statistics API with locks and microtasks to clear caches.

jan-ivar commented 1 year ago

Except for a microtask to clear the cache every time the getter is used by JS, right?

While that might be one way to invalidate a cache, it doesn't seem like the only way. E.g. a hypothetical implementation that counted its tasks could simply compare its current task number against the task number in the cache, and learn whether to refresh the cache that way. No microtask needed.

henbos commented 1 year ago

OK so ring buffer and task number would do the trick in that case.

Note that the presented usecase for this kind of data for realtime processing is audio latency, where we do not have IPC for each audio chunk and where an API at MediaStreamTrack level does not seem like a great fit.

In this case should extra IPC be added for the sake of sync API or would the sync API just need to return frequently updated information that may be less frequent than each audio chunk?

henbos commented 1 year ago

I'd like us to unblock this issue. Even if I don't like accessing off-thread information and caching it, both track.stats and await track.getStats() get the job done. We should come to a decision to unblock prototyping/shipping.

henbos commented 1 year ago

There is a lot of implementation freedom (mutex, ring buffer, or periodically updated internal slot) as long as the exposed information only has to be "relatively recent"

henbos commented 1 year ago

Does the lockless ring buffer block the producer if the consumer has not read from it yet? If we're talking about something like this and its performance remarks