API design: should depacketizers provide data as impl Buf or Bytes/Vec<u8>?

Right now retina::codec::VideoFrame implements Buf and supports only H.264. It always has two chunks: an AVC length prefix and the header+body of the (single) picture NAL (note I'll need to extend it for multiple NALs). The idea was to support zero-copy, but I think right now it's kind of the worst of all worlds:

if the NAL is fragmented (most of the bytes and maybe even most of the NALs on a typical stream), it copies it all into a BytesMut with a guesstimated size. It might copy again to grow partway through, and then probably (I haven't looked at profiles) copies it again on BytesMut::freeze due to tokio-rs/bytes#401.
if the caller wants a single contiguous buffer, they'll end up copying it themselves. (Maybe with bytes::Buf::copy_to_bytes, maybe not.)
you can only iterate through it once, which might be a problem for some usages.

AudioFrame and MessageFrame provide a Bytes directly; there's no good reason all three frame bytes shouldn't present their data in the same way.

I'd prefer to follow one of two paths. I haven't made up my mind as to which:

truly support zero-copy via data(&self) -> impl Buf<'self>. VideoFrame needs a Vec of chunks. If you want to iterate multiple times, you can just call data as many times as you want. You can use Buf::chunks_vectored + std::io::Write::write_vectored / tokio::io::AsyncWriteExt::write_vectored.
put everything into a single buffer. Accumulate Bytes in a reused Vec in Depacketizer::push then concatenate them during Depacketizer::pull when the total size is known, avoiding the extra copies mentioned above.

Arguments in favor of zero-copy (custom Buf implementation with multiple chunks):

People like zero-copy; it's generally assumed to be more efficient (but see below).
One neat API trick this would allow (for H.264) is selecting the Annex B encoding (00 00 00 01 between NALs) or the AVC encoding (length prefix between NALs) via something like h264(&self) -> Option<&H264Frame> then data_as_annexb(&self) -> impl Buf or data_as_avc(&self, len_prefix_size: usize) -> impl Buf. With the single-buffer approach I'd probably make folks choose when setting up the depacketizer instead. (I guess it'd also be fairly efficient to have a &mut accessor on the VideoFrame which does a one-way conversion from 4-byte AVC to Annex B, but that's a weird API.) I can imagine someone doing some fan-out thing where they actually want both encodings.

Arguments in favor of a single Bytes or Vec<u8>:

More convenient / simpler to get right IMHO. The zero-copy APIs seem half-baked/finicky, eg tokio's tokio::io::AsyncWriteExt::write_buf just writes the first chunk, and there's no write_all_vectored in either std::io::Write or tokio::io::AsyncWriteExt.
It might actually be more efficient. The answer isn't obvious to me. With my IP cameras, the IDR frames can be half a megabyte, fragmented across hundreds of RTP packets of 1400ish bytes. Just the &[Bytes] is then tens of kilobytes (four pointers per element). Far too big for the no-alloc path of SmallVec<[Bytes; N]>. And if someone's doing a writev call later, they have to set up/iterate through hundreds of IoSlices to write the whole thing at once. (And you can't even reuse a Vec<IoSlice> between iterations without trickery because it expects to operate on a mutable slice of initialized IoSlice objects.) It wouldn't surprise me if zero-copy is actually slower.
I'm not sure but I think Bytes is mostly just a tokio thing. async std folks might use just Vec<u8> or something instead. Although I'll likely keep using Bytes internally anyway to keep the individual packets around between push and pull.

Right now I'm leaning toward single Bytes. I might try benchmarking both but if the performance is close I think simplicity should win.

scottlamb / retina

API design: should depacketizers provide data as impl Buf or Bytes/Vec<u8>? #4