w3c / webcodecs

WebCodecs is a flexible web API for encoding and decoding audio and video.
https://w3c.github.io/webcodecs/
Other
953 stars 135 forks source link

Audio codec, priming samples and padding, without container or information from the encoder #626

Open padenot opened 1 year ago

padenot commented 1 year ago

How are implementations supposed to handle encoder delay / priming samples / packet padding?

If PCM is encoded with a particular codec, and then decoded, does it roundtrip properly? I don't think it can, because the number of priming samples is not returned by the encoder, so there's no way to mux the file properly to signal this (e.g., using an edit list in an mp4, or any other codec/container pair specific way).

It's possible to make a guess for some codecs (e.g. AAC is frequently 2112 frames), but this isn't ideal (and sometimes it depends on the implementation of a codec).

This is important for accurate synchronization, but a requirement for musical applications (because it breaks sample-accurate looping).

dalecurtis commented 1 year ago

Agreed! Chrome doesn't do anything today for this. @tguilbert-google

tguilbert-google commented 1 year ago

You're right.

Chrome pads the end of Opus encodes with silence.

Would the idea to spec how much priming/padding should be used, or just to expose the information and let users trim accordingly?

tguilbert-google commented 1 year ago

As briefly discussed with Dale today, this could probably be an addition to EncodedAudioChunkMetadata

padenot commented 1 year ago

A design that would be fairly clean would be to have this information on the EncodedAudioChunk themselves.

For the encode side, this would output EncodedAudioChunk.timestamp that can be negative, and EncodedAudioChunk.duration that can be different from the number of audio frames actually obtained when decoding (internally).

Then when decoding, it's clear that if the pts is negative, the output frames are trimmed, up to a start time of zero, and if the duration is different from the number of frames, the frames are also trimmed, so that the number of frames matches the duration. This can be done by the Web Codec implementation, so there's no confusion.

This means that encoding / decoding round trips properly without having to do anything, and that there is no extradata to carry at any point, each packet carries the information. Additionally, with this design, it's now impossible to ignore the fact that some audio isn't to be rendered, generally yielding better output. More over, this means that comparing timestamps for audio and video makes sense, because they're on the same timeline, and not offset by the encoder delay. This also makes looping of audio assets seamless by default when looping the PCM output (while keeping in mind that decoders need to be reset and that all packets, including the ones with negative timestamps, are to be decoded again for the output to be correct on the second loop iteration).

On the other side of this issue, authors that write demuxers for use with Web Codecs can decide to create EncodedAudioChunks with negative timestamps/adjusted duration for their EncodedAudioChunk, if they know the encoder delay / padding. Depending on the container, some subtraction will be needed.

Authors that write muxers will be able to look at the timestamps of EncodedAudioChunks and map this to what the container prefers (e.g. this can be the same, negative timestamps, or sometimes an explicit encoder delay + a total media time to know the padding, etc.).

dalecurtis commented 1 year ago

This is kind of the scheme ffmpeg uses and while it works it does have some warts:

Having an explicit sample count to trim resolves these issues. In general I haven't loved this scheme over the years, but it is mostly tried and true -- so long as there is a precise set of wpt tests for each codec whatever scheme we choose is fine though.

padenot commented 1 year ago
  • You can't assume start time is zero; e.g., first packet timestamp might be 0 < timestamp < padding.

In general, we set the first packet timestamp to be negative (when that indicates the preroll), not really something in between 0 or the padding (maybe you meant preroll/encoder delay?).

It's also frequently 0, and then can be anything greater than zero, but that doesn't have a relation with the encoder delay either, that can be any file that's been demuxed / remuxed without reencoding or any other processing (e.g. part of a live stream). That's a bit broken because quite a few codecs need this delay, so rendering might be wrong, except if the author actually splits the file a bit before the intended start point and then discards the first few decoded output buffers correctly.

But here Web Codecs will just decode the packets, and it is assumed folks know what they are doing. A demuxer or post-demuxing/pre-decoding pass can also rebase the clock if it knows the intended start point, and offset everything, so it's back to signaling the encoder delay, discarding, and everything is fine again.

If we find EncodedAudioData with weird/negative timestamps, because the demuxer is passing the timestamp through without looking at them, anything might happen indeed.

  • High sample rates will affect accuracy of a timestamp/duration relative to frames trimmed. E.g., 1/768000 falls below microsecond precision.

This is technically true, but I'm not sure that it matters, because 768kHz audio won't be in AAC (max 192kHz) or MP3 (max 48kHz) or Opus (always 48kHz), or Vorbis (looks like 192kHz) or whatever else but PCM, not even flac (max (2 << 15) * 10 Hz, which is 655360Hz). And WAV doesn't have encoder delay / padding.

  • Users may not always want negative timestamps trimmed.

If negative timestamps are considered to be pre-roll / padding, then I'm not sure when they'd like to have them, but maybe I'm missing something.

Having pre-roll/padding separately and in sample-frame does indeed resolve some issues at the expense of requiring all authors to write the same code for most codecs, which I don't find satisfactory because that won't happen, and web apps will generally be worse.

We can also stick two more properties on EncodedAudioChunk and have a facility on AudioData to be able to automatically get the PCM data correctly trimmed (probably with two more attributes). We could make those sample-frame-based and not microsecond based. During the initial design, we originally kind of said that for any precise audio work, authors would have to count sample-frames anyway. This would also allow authors to get the audio data in PCM that's not to be rendered, although I’m not sure why anybody would want to do that.

Also, to answer your last point I've got a rather extensive test suite for this precise problem though (to test Firefox's AudioContext.decodeAudioData, HTMLMediaElement.duration and sample-accurate seamless HTMLMediaElement.loop), but that's easily portable to WPT in the sense that it doesn't use any internal testing facility/API.

It's a collection of WAV files generated exactly for this purpose, of length that is known (sometimes exactly 0.5s, or sometimes exactly 1.0s), at various sample-rates, channel count, that precisely loop seamlessly without discontinuity and then converted with all encoders that I have access to for all the codecs we care about and then also muxed in all containers that make sense for those codecs, sometimes with various ways of indicating preroll/padding (depending on the container). It's fairly probably that I end up extending those tests and porting to WPT while implementing the audio side of Web Codecs.

dalecurtis commented 1 year ago

In general, we set the first packet timestamp to be negative (when that indicates the preroll), not really something in between 0 or the padding (maybe you meant preroll/encoder delay?).

I meant we've seen streams where the first packets has discard padding and the pts > 0. If negative timestamps are the only way to express discard that will require client side manipulation.

If we find EncodedAudioData with weird/negative timestamps, because the demuxer is passing the timestamp through without looking at them, anything might happen indeed.

This will definitely happen :) We still get a couple weird timestamp bugs each quarter.

This is technically true, but I'm not sure that it matters, because 768kHz audio won't be in AAC (max 192kHz) or MP3 (max 48kHz) or Opus (always 48kHz), or Vorbis (looks like 192kHz) or whatever else but PCM, not even flac (max (2 << 15) * 10 Hz, which is 655360Hz). And WAV doesn't have encoder delay / padding.

Correct, I mention it only as future proofing.

  • Users may not always want negative timestamps trimmed.

If negative timestamps are considered to be pre-roll / padding, then I'm not sure when they'd like to have them, but maybe I'm missing something.

Maybe this is what you meant by padding, but mp4 edit lists may shift arbitrary amounts before zero for discard -- so it's not always just decoder delay.

Having pre-roll/padding separately and in sample-frame does indeed resolve some issues at the expense of requiring all authors to write the same code for most codecs, which I don't find satisfactory because that won't happen, and web apps will generally be worse.

We can also stick two more properties on EncodedAudioChunk and have a facility on AudioData to be able to automatically get the PCM data correctly trimmed (probably with two more attributes).

Sorry, I wasn't arguing for both separately and in-frame, but rather either in-frame or metadata. My preference being in EncodedAudioChunkMetadata if we desire clients to handle it manually and AudioDecoderConfig if we want the UA to handle things automatically. I pictured this as discardFrames = {fromStart: #, fromEnd: #} on one of those structures.

I like this approach since container discard metadata is often specified as a duration / frame count / time range without regard to how many frames each individual packet has. By putting it on the chunk metadata / decoder config we can avoid clients manually mapping this data onto input chunks (which may require codec level understanding) or worry about shifting timestamps across the entire stream.

I think the AudioDecoderConfig approach requires less consideration (one shot versus per chunk) and code than negative timestamps -- which AFAIK aren't common outside of ogg containers. E.g., mp3 and adts use xing and mp4 uses itunsmb or elst.

elst can cut from anywhere, but I feel we already provide the necessary AudioData.copyTo() options to handle such cuts so I don't see a lot of value in providing per frame discard annotations.

Also, to answer your last point I've got a rather extensive test suite for this precise problem though (to test Firefox's AudioContext.decodeAudioData, HTMLMediaElement.duration and sample-accurate seamless HTMLMediaElement.loop), but that's easily portable to WPT in the sense that it doesn't use any internal testing facility/API.

Great! That will be really helpful.

dalecurtis commented 1 year ago

Sorry, I just realized fromEnd won't work since we don't know EOS ahead of time and unless it's Vorbis most codecs generate output for each input. So I'd amend my suggested solution to just discardFrontFrames or discardFrontDuration, but acknowledge that can't round-trip a signal cleanly without the client manually trimming.

I'm also worried about reusing timestamp, duration since containers are infamous for these fields being imprecise. I think we'd apply trimming more often accidentally than intentionally.

So my vote is still for explicit fields; either both on the chunk (accepting that discardFrontFrames may be larger than a single packet) or only allow discardFrontFrames on the config and discardEndFrames on the chunk to ensure values are always limited to a single packet (and no cumulative / cancellation effects need to be considered).

We can also stick two more properties on EncodedAudioChunk and have a facility on AudioData to be able to automatically get the PCM data correctly trimmed (probably with two more attributes).

I didn't follow this point. Why would we need something on AudioData? I assumed we're providing this mechanism on the chunk to automatically remove padding frames; so AudioData would only have the trimmed signal in it.

dalecurtis commented 1 year ago

One new proposal: