kixelated / moq-js

Typescript library for Media over QUIC
Apache License 2.0
128 stars 25 forks source link

Fix audio #3

Open kixelated opened 1 year ago

kixelated commented 1 year ago

I disabled audio because I was getting tired of working on the player. It needs to be synchronized with video, which is actually kind of annoying thanks to WebAudio.

zenomt commented 10 months ago

i recommend adopting the Flash Player "Tin-Can" timing model:

as an example, tcaudio.js + tcaudioprocessor.js in my RTWebSocket repo implement a Tin-Can-like interface* for WebAudio playback (with a volume control even), and tcmedia.js uses WebCodecs to decode video & audio and play them back synchronized (including if audio frames are dropped), with adjustable buffering and jitter adaptation, using tcaudio.

the only tricky part is actually caused by WebCodecs: at least the last time i checked in the Chrome WebCodecs implementation, even though the documentation says that the timestamp is supposed to be carried through unmodified between feeding in a coded frame to an audio decoder and getting the decoded raw samples out, it actually isn't; instead the timestamps on the decoded frames are just incremented by the duration of each decoded frame starting from the last decoder init/flush, so if any frames are missing, the timestamps of the decoded samples will be out of sync and wrong. this requires some janky heuristics to work around. you can't flush the decoder after every frame because (for stateful codecs like AAC) that causes an audio glitch.

* with some of your favorites like bufferTime, bufferLength, currentTime (for NetStream.time), and status events like NetStream.Buffer.Full, NetStream.Buffer.Empty, and NetStream.Buffer.Flush.

chrisprobst commented 10 months ago

@zenomt How would you deal with clock drift without something like NetEQ? Playing audio at wall clock seems reasonable but it can't compensate for the fact that 20 ms on my maschine are not 20 ms on your machine. You have to stretch and shrink audio using something like NetEQ. Or do I miss something?

zenomt commented 10 months ago

my implementation handles jitter and clock drift the same way.:

if you wanted to be extra fancy (my implementation doesn't do this), then to stave off an underrun, you could double some% of audio frames if the minimum buffered amount falls below a safety threshold.

for codecs like AAC and Opus, dropping or doubling audio frames isn't tremendously disruptive as long as the "some" percent is small (like less than 1%).

unfortunately, the Chrome implementation of the WebCodecs decoder doesn't carry through the timestamp on every coded audio frame, so dropping or doubling a frame will disrupt the timestamps on the output of the decoder. the workaround is to flush the decoder whenever you know you're dropping (or doubling) a coded frame, which will cause a little "pop" but reset/resync the timestamps on the decoded frames.

chrisprobst commented 10 months ago

Thanks for going into details.

zenomt commented 10 months ago

@chrisprobst (and Luke): you can see how the above works in my implementation here:

chrisprobst commented 10 months ago

@zenomt I have trouble understanding how buffering solves the underrun issue effectively. With clock drift, every audio packet would play too fast and because audio is inherently linked to video, there would be essentially every audio packet an underrun or during avsync a short forced silence. I believe for smooth playout you really need something like NetEQ.

zenomt commented 10 months ago

most computer audio hardware is reasonably accurate, and pretty stable (if it was unstable you'd hear that easily, and if the sample rate was way off the pitch would be noticeably wrong). i've observed sample rates around 48008/s for a nominal rate of 48000/s (about 1.7 parts in 10000). for AAC and a nominal sample rate of 48000/s, each AAC frame (1024 samples) is 21.3 ms long. at the quantum of "AAC frame", and assuming a "very live" setting with an instantaneous resume (rather than accumulating a longer buffer), you'd underrun one frame time every 125 seconds (2 minutes), and that underrun would only be 21.3 ms of silence. if you accumulated a longer buffer when rebuffering, the instance of underruns requiring a rebuffer would be proportionally longer. for a half-second buffer and the above drift, you'd need to rebuffer (for half a second) every 49 minutes.

kelvinkirima014 commented 1 month ago

Hey @kixelated,

I'd love to enable audio in this library and implement a decent form of A/V sync to make it more usable. What's the first thing I should look at, or where should I start? Any pointers would be greatly appreciated. :)

kixelated commented 1 month ago

I think the audio worklet needs a rewrite, but I don't know the precise problem. The current approach of using ordered Streams kinda sucks and results in blindly queuing data with no expectation of when it will (and should) be played.

kelvinkirima014 commented 1 month ago

I have been down a rabbit hole, turns out; web audio is hard. First goal is to get any kind of audio out even if it's noise as all we got right now is silence. Looking at the code here, it seems we only send the video canvas and never forward the audio channels?

I also encounter this warning after I run the code locally and try watching a published stream:

The AudioContext was not allowed to start. It must be resumed (or created) after a user gesture on the page.

Apparently could be fixed by making sure getAudioContext().resume() is called somewhere similar to what you did with watch here but I'm not sure where exactly that fits in the publisher.

zenomt commented 1 month ago

gentle reminder that i posted links to running code using WebAudio, as well as a suggestion on a timing model for A/V sync, above in this Issue back in october 2023.