Not requiring per-packet JS Events

tonyherre commented 7 months ago

We talking before in the Design Discussion meetings about the need for reducing the CPU & power overhead of using RTPTransport APIs. One large source we've seen in Chrome when working with Encoded Frame APIs is that frequent (100s of Hz) JS event scheduling brings a significant amount of overhead due to context switching, thread wakeups, JS event management etc. eg see crbug.com/40942405.

To avoid these issues at the packet level, which is inherently higher frequency than frames, we would need to avoid requiring apps to have JS callbacks execute individually per-packet, for both send, receive and BWE, in order to achieve their usecases.

We've already talked about this a few times in the design discussion meetings, and come to ideas about having APIs to read batches of all available packets etc. The same needs to be considered for BWE feedback signals & mechanisms.

pthatcher commented 6 months ago

onrtpreceived has readReceivedRtp and onpacketizedrtpavailable has readPacketizedRtp (so you don't have to handle the event), but onrtpsent currently does require per-packet processing, and onrtpacksreceived requires per-feedback-message processing. So that's where we would need attention, moreso on onrtpsent.

tonyherre commented 6 months ago

Feedback messages are at least somewhat aggregated, sent every 50-250ms in libwebrtc, recommended per frame or less often in the ietf draft.

onrtpsent is my bigger concern, indeed. Is it enough to just get the actual sent timestamps in the RtpAcks message somewhere and not be notified on every send? Depends on whether it's the UA deciding that packets are lost and notifying up (akin to the TransportPacketsFeedback struct in libwebrtc passed to the googcc impl) or the JS making its own decision based on being notified of a send but not of a remote feedback ack.

youennf commented 6 months ago

It would be good to do some measurements to validate we are not over optimising things at the expense of API ease of use. For instance, it is possible for the UA to batch the rtp sent events and do a single hop to the worker thread for multiple rtpsent events. That might not fix the JS event management overhead but maybe this overhead is not a blocker.

tonyherre commented 6 months ago

I definitely agree some more numbers here to get a better sense of the sweetspot of the tradeoff. A previous design discussion came to the broad consensus that something like >10k Hz of events is probably too much, <100 Hz probably fine, but largely based on gut feeling iirc. The best evidence we've found for overheads in Chromium have been looking at trace recordings and pprofs, eg a single Audio Encoded Frame transform, so running at 50 Hz, was consuming increasing the CPU load of a Meet page by 0.2% just to get the audio data from webrtc wrapped in an ArrayBuffer and into a JS event (internal ref: http://shortn/_ZzJdD2pcQS). I'll try to get something less anecdotal that can be shared publicly indetail. Assuming this is representative though, and scales ~linearly, launches needing to get towards the kHz range to be useful would start being blocked as major CPU regressions.

This did at least result in some V8 optimizations - eg crbug.com/40287747 - but there's still plenty of cost at high frequency.

dontcallmedom-bot commented 5 months ago

This issue was mentioned in WebRTC Interim, May 21st – 21 May 2024 (RtpTransport (Peter Thatcher))

pthatcher commented 5 months ago

Note from discussion: Would be good to have "batch read" methods for onrtpsent and onrtpacksreceived, like readSentRtp and readAckedRtp.

tonyherre commented 5 months ago

Slides I had at the Discussion Group earlier this week pertaining to this issue: https://docs.google.com/presentation/d/1bIuSiUsAiYsokxfBbxVZiaQwiFqO7D2e6my7dGacHEo/edit#slide=id.p

TL;DR: There are 3 sources of CPU/Power consumption cost due to an API with high frequency events:

OS scheduling interactions with thread hops/wakeups, CPU frequency bumping (solvable in the UA)
JS object creation & GC per event/packet (not so large in initial prototyping on Chromium, good to get BYOB interfaces to avoid ArrayBuffer churn - filed #41 to follow up)
JS Event creation & task scheduling & context switching (a major cost at the 1-10,000 events per second level, very much helped with batch APIs like we have here)

I'll prep a PR adding batch interfaces where an app would currently be forced to install Event listeners to get all of the required info.

youennf commented 5 months ago

The alternative to an event that expose multiple packets is a ReadableStream of packets, which is more or less what WebTransport chose. A packet per micro task is probably cheaper than a packet per event loop task. Why not making the same choice here?

Also, supporting multiple packet processors/listeners is probably not a requirement, which makes events less interesting.

w3c / webrtc-rtptransport

Not requiring per-packet JS Events #20