Unexpected number of AudioEncoder samples

vjeux commented 1 year ago

I'm trying to re-encode a h264 video and got the video part working but I'm struggling to get the audio working.

I created the smallest end to end re-encoding where I'm trying to re-encode a single audio frame. I would expect the encoder when passed a single frame to encode it as a single frame with the same duration regardless of the codec. But in practice various codecs / versions of Chrome are giving different results. Could people with knowledge help me understand whether I'm not understanding how audio codec work or whether there's a bug in the implementation? Thanks!

Using opus codec in Chrome production 108.0.5359.124. It gives one frame but the duration is wrong.

decoded 0 AudioData {format: 'f32-planar', sampleRate: 48000, numberOfFrames: 1024, numberOfChannels: 1, duration: 21333, …}
encoded 0 EncodedAudioChunk {type: 'key', timestamp: 0, byteLength: 279, duration: 60000}

Using opus codec in Chrome canary 111.0.5522.0. It gives two frames, with a duration that is closer (20000) but not the initial one (21333) which it should be as (1 / 48000) 1024 1e6 = 21333.33. (one second / sampleRate) numberOfFrames normalization.

decoded 0 AudioData {format: 'f32-planar', sampleRate: 48000, numberOfFrames: 1024, numberOfChannels: 1, duration: 21333, …}
encoded 0 EncodedAudioChunk {type: 'key', timestamp: 0, byteLength: 108, duration: 20000}
encoded 20000 EncodedAudioChunk {type: 'key', timestamp: 20000, byteLength: 95, duration: 20000}

Using mp4a.40.2 or any of the variations listed in the spec in both production and canary version of Chrome. The duration is correct but it outputs 4 frames instead of one.

decoded 0 AudioData {format: 'f32-planar', sampleRate: 48000, numberOfFrames: 1024, numberOfChannels: 1, duration: 21333, …}
encoded 0 EncodedAudioChunk {type: 'key', timestamp: 0, byteLength: 4, duration: 21333}
encoded 21333 EncodedAudioChunk {type: 'key', timestamp: 21333, byteLength: 157, duration: 21333}
encoded 42666 EncodedAudioChunk {type: 'key', timestamp: 42666, byteLength: 98, duration: 21334}
encoded 64000 EncodedAudioChunk {type: 'key', timestamp: 64000, byteLength: 4, duration: 21333}

Here is the standalone test file I created to reproduce.

<script>
async function f() {
  const audioDecoder = new AudioDecoder({
    async output(audioData) {
      console.log('decoded', audioData.timestamp, audioData);
      audioEncoder.encode(audioData, { keyFrame: true });
      audioData.close();
    },
    error(error) {
      console.error(error);
    }
  });

  audioDecoder.configure({
    codec: 'mp4a.40.02',
    numberOfChannels: 1,
    sampleRate: 48000,
  });

  const audioEncoder = new AudioEncoder({
    output(chunk, metadata) {
      console.log('encoded', chunk.timestamp, chunk);
    },
    error(error) {
      console.error(error);
    }
  });

  audioEncoder.configure({
    codec: 'opus', // Change encoding codec here
    numberOfChannels: 1,
    sampleRate: 48000,
  });

  audioDecoder.decode(new EncodedAudioChunk({
    duration: 21333.333333333332,
    timestamp: 0,
    type: "key",
    data: new Uint8Array([0, 248, 23, 173, 52, 166, 85, 134, 2, 33, 96, 161, 152, 72, 103, 250, 248, 214, 166, 43, 139, 187, 67, 89, 139, 141, 10, 109, 168, 107, 172, 176, 160, 19, 133, 196, 201, 240, 10, 86, 109, 180, 181, 177, 186, 49, 223, 251, 71, 109, 151, 64, 180, 246, 210, 90, 190, 91, 41, 34, 211, 50, 118, 166, 219, 251, 110, 180, 221, 229, 76, 83, 34, 93, 129, 215, 177, 25, 118, 83, 109, 197, 42, 145, 117, 102, 101, 192, 136, 156, 232, 109, 29, 14, 97, 255, 119, 155, 114, 86, 170, 154, 84, 134, 128, 123, 113, 251, 182, 174, 234, 109, 139, 182, 232, 242, 223, 47, 130, 97, 182, 160, 107, 109, 175, 56, 196, 8, 185, 44, 146, 182, 113, 45, 221, 123, 50, 125, 27, 23, 192, 172, 59, 146, 124, 121, 69, 159, 249, 124, 210, 113, 245, 231, 54, 65, 60, 106, 238, 153, 169, 209, 162, 97, 182, 166, 153, 54, 24, 156, 62, 235, 170, 28, 82, 185, 64, 67, 13, 117, 34, 152, 95, 138, 135, 36, 29, 73, 222, 222, 235, 248, 89, 191, 103, 232, 219, 78, 14, 21, 0, 234, 112, 148, 17, 43, 71, 79, 242, 52, 225, 42, 230, 155, 63, 9, 164, 225, 196, 220, 64, 65, 146, 163, 69, 241, 117, 77, 109, 202, 36, 114, 148, 105, 44, 89, 43, 142, 54, 82, 90, 237, 26, 1, 130, 2, 239, 67, 232, 218, 120, 127, 5, 115, 238, 94, 103, 89, 234, 164, 112, 4, 185, 72, 218, 226, 106, 169, 13, 3, 69, 128, 183, 222, 224, 99, 37, 247, 185, 192, 201, 83, 119, 192, 91, 62, 197, 110, 164, 184, 121, 175, 254, 6, 172, 247, 185, 222, 169, 19, 127, 135, 72, 96, 110, 150, 2, 251, 72, 167, 224, 172, 175, 27, 30, 188, 180, 229, 78, 200, 231, 141, 157, 128, 227, 114, 53, 197, 56, 54, 250, 151, 224]),
  }));
  await audioDecoder.flush();
  await audioEncoder.flush();
}
f();
</script>

This is the first audio frame from: mp4_with_sound.mov.zip

vjeux commented 1 year ago

I just discovered that AAC uses a rolling window so the algorithm can't actually encode the first frame of data 1 to 1. The workaround that people do is to send a few silent frames at the beginning to "warmup" the algorithm. I'll probably have to do this little dance on userland to be able to encode audio with AAC and have the same duration as the input.

https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/QTFFAppenG/QTFFAppenG.html

dalecurtis commented 1 year ago

I think this is working as intended. If you don't specify a duration to the audio encoder (supported on canary), we'll assume a default (60ms on stable, 20ms on canary). Provided chunks are aggregated into a single encoded buffer matching the configured encoding size.

Generally you don't need to worry about 1 in 1 out for audio encoding. Just send in the chunks and flush when you're done to get everything that remains.

@tguilbert-google to close if everything looks good.

vjeux commented 1 year ago

When encoding the entire audio track, we can see a consistent pattern where the first frame is empty and 3rd frame is double the size. And at the end, we get 3 more frames that are trailing off in terms of number of bytes.

frame #0, encoded size: 4    // first frame is basically empty
frame #1, encoded size: 166
frame #2, encoded size: 359 // third frame is double the size
frame #3, encoded size: 124
frame #4, encoded size: 159
frame #5, encoded size: 164
frame #6, encoded size: 168
frame #7, encoded size: 162
frame #8, encoded size: 161
// ...
frame #479, encoded size: 160
frame #480, encoded size: 158
frame #481, encoded size: 171
frame #482, encoded size: 162
frame #483, encoded size: 161
frame #484, encoded size: 157
frame #485, encoded size: 169 // we encoded 486 frames so it should be the last one
frame #486, encoded size: 173 // but we get more data that's trailing off in terms of size
frame #487, encoded size: 152
frame #488, encoded size: 34

So it looks like this is the way the AAC encoding algorithm works and I'll have to do work on-top to get the durations to line up. It reminds me of this video where Marques Brownlee re-encodes the same video over and over and the audio shifts over time. They probably straight up re-encode and don't do any workaround for this.

padenot commented 1 year ago

https://github.com/w3c/webcodecs/issues/626 is related (but decode side).

w3c / webcodecs

Unexpected number of AudioEncoder samples #624