w3c / webcodecs

WebCodecs is a flexible web API for encoding and decoding audio and video.
https://w3c.github.io/webcodecs/
Other
951 stars 135 forks source link

Byte stream formats #38

Closed sandersdan closed 3 years ago

sandersdan commented 4 years ago

As currently defined, WebCodecs supports packetized codecs, where we expect one decoded frame per encoded chunk. For some codecs (eg. H.264 in Annex B format), it makes sense to use a byte stream instead.

This changes the interface of an encoder or decoder, so it's not a trivial change. It doesn't seem to be compatible with our flush or configure model unless streams gain support for flush.

chcunningham commented 4 years ago

Discussed with @sandersdan, we lean toward a won't fix here. Bytestreams still have discrete frame boundaries such that "chunks" can be identified and provided the codec. Alternatives seem to make the API much more complicated for little benefit.

aboba commented 4 years ago

AV1 also uses a byte stream format. Are you saying that WebCodecs won't be able to support AV1??

sandersdan commented 4 years ago

A different way to say what @chcunningham is saying: if it can be packaged in an MP4, then we do not need a ReadableByteStreamController.

The terms are different between streams and media codecs:

There are unpacketized bitstreams (eg. H.264 Annex B), but I am unaware of any that don't also have standard packetizations.

There is a spectrum of possible implementations in WebCodecs:

I prefer the last one because it allows us to accept frame metadata (such as timestamp) alongside the bytestream chunks, but it's conceivable that there exists (or will exist) a format for which this doesn't make sense.

aboba commented 4 years ago

@sandersdan @DanilChapovalov

The AV1 bitstream specification is packetized as specified in the AV1 RTP payload specification. The AV1 bitstream format uses OBUs (similar to H.264 NAL units), including Time Delimiter (TD), Sequence Header (SH), MetaData (MD), Tile Group (TG) and Frame Header (FH) OBUs. As an example, the following bitstream:

TD  SH MD MD(0,0) FH(0,0) TG0(0,0) MD(0,1) FH(0,1) TG(0,1)

would typically be packetized as follows:

[ SH MD MD(0,0) FH(0,0) TG(0,0) ] [ MD(0,1) FH(0,1) TG(0,1) ]

This seems like it might qualify as "arbitrary chunks" or "meaningful chunks", but probably not "chunks that are exactly one sample".

sandersdan commented 4 years ago

AV1 is also packetized in chunks that are exactly one sample in the ISO BMFF binding.

sandersdan commented 4 years ago

Also worth noting that a decision here could affect #13, and theoretical future video formats that support progressive decoding.

My gut instinct is that progressive decoding is for still images and should be a separate API, but I'd like to understand that design space better.

sandersdan commented 4 years ago

And one more note: for low-latency streams, it may be beneficial to submit slices/tiles individually as they arrive from the network, and the opposite for encoding. (So 'meaningful chunks'.)

If we support that, it's important to make sure we don't also make muxing harder for less latency-sensitive cases. A 'partial' flag for input and output chunks may be enough (and could be added in a v2).

murillo128 commented 4 years ago

I would be against byte streaming progressive decoding, that is, feeding the decoder with a byte stream without explicit boundaries (it may be inline as in h264 with the nal start header sequence "001") and let the decoder decide where are the relevant start/end bytes for each decodable chunk.

I think that the question is really if we serialize the encoding units that the encoder produces(would be the group of nals in h264 or obus in av1, or particions in vp8) into a byte array (i.e. the byte stream format) or if we just output an array of chunks so the app packetizes it at will.

Note that typically encoders provide is the later, for example in vp8 you encode the frame and then return each partition: https://github.com/webmproject/libvpx/blob/master/examples/simple_encoder.c#L124

  const vpx_codec_err_t res =
      vpx_codec_encode(codec, img, frame_index, 1, flags, VPX_DL_GOOD_QUALITY);
  if (res != VPX_CODEC_OK) die_codec(codec, "Failed to encode frame");

  while ((pkt = vpx_codec_get_cx_data(codec, &iter)) != NULL) {
    got_pkts = 1;

    if (pkt->kind == VPX_CODEC_CX_FRAME_PKT) {
      const int keyframe = (pkt->data.frame.flags & VPX_FRAME_IS_KEY) != 0;
      if (!vpx_video_writer_write_frame(writer, pkt->data.frame.buf,
                                        pkt->data.frame.sz,
                                        pkt->data.frame.pts)) {
        die_codec(codec, "Failed to write compressed frame");
      }
      printf(keyframe ? "K" : ".");
      fflush(stdout);
    }
  }

x264 the same, providing the array of nals as ouput of x264_encoder_encode

There are pros and cons about doing it this way (which would also affect as what we accept as input in the decoder).

The good part is that providing the individual encoding units (nals/obus/partitions) it is easier to convert it to any frame-based stream format (for example to h264 annex b format) and it is easier to do an rtp packetization (if not you would have to typically parse the byte stream to find the nals/obus and apply packetization afterward).

The bad part is that this requires that the serialization is done on the app side before passing it to the appropriate transport (webrtc could be different as the packetization should be done inside of it).

murillo128 commented 4 years ago

Also, as a side note, SVC codecs (like vp9) produces several "frames" per input video frame, so it would not be easy to produce a single chunk from the encoder.

Tauka commented 3 years ago

Hello! For h264, does it mean we have to group NAL units by ourselves before creating EncodedVideoChunk? Currently I am receiving individual NAL units from RTSP stream, and trying to figure out correct way to decode them via Webcodecs API. I would appreciate any help

sandersdan commented 3 years ago

That's correct.

If your source is not framed then you will need to identify access unit boundaries. If your source includes AUD (Access Unit Delimiter) units then that's quite easy (break right before each AUD). It's also relatively easy if you know there is only once slice per frame and no redundant or auxiliary slices (break after each slice). Beyond that you'll probably want to read the H.264 spec.

sandersdan commented 3 years ago

Note for VP9 spatial SVC: My current understanding is that the several frames should in fact be separate frames, but they have the same timestamp. There is an asymmetry here; for encoding you should only be passing in the highest-resolution version of each frame.

I expect our encoders will output multiple chunks (one for each resolution) but they will have the same timestamp.

I still need to do some research to figure out if its technically valid to bundle them into a single chunk. (Presumably libvpx is/would already be bundling them like that if it's valid.)

chcunningham commented 3 years ago

To the core issue of slices/tiles vs 'meaningful' chunks, Chrome's longstanding behavior has been 'meangingful' chunks and this has been demonstrated to work great for a variety of use cases (RTC, Low latency streaming, Video editing, etc...). If slices/tiles is later desired, we should do this without breaking the API (e.g. specified as an option in VideoDecoderConfig, for which the default is 'meaningful' chunks). Hence I've marked the issue as 'extension'.

Having said that, we've had no real demand for this from users and I vote to just close the issue until demand arrives. @sandersdan WDYT?

If your source is not framed then you will need to identify access unit boundaries. If your source includes AUD (Access Unit Delimiter) units then that's quite easy (break right before each AUD). It's also relatively easy if you know there is only once slice per frame and no redundant or auxiliary slices (break after each slice). Beyond that you'll probably want to read the H.264 spec.

The codec registry should document this. Work tracked in #155.

I still need to do some research to figure out if its technically valid to bundle them into a single chunk. (Presumably libvpx is/would already be bundling them like that if it's valid.)

We discussed this more w/ SVC folks and learned separate chunks is how its done.

sandersdan commented 3 years ago

Closing is acceptable to me. Even if there is demand, breaking a stream into chunks may fit better in a containers API anyway.