errors when feeding VStarcam data to ffmpeg

As discussed in #13, connecting to a VStarcam camera and feeding its frames to ffmpeg produced ffmpeg errors. From discussion there:

@lucaszanella wrote:

However, on my app, while the frame producing works, passing retina::codec::VideoFrame::data().borrow() to the ffmpeg nal units parser sometimes reject, and sometimes parse and send to the decoder, which procues

[h264 @ 0x559374083fc0] non-existing PPS 16 referenced
[h264 @ 0x559374083fc0] Invalid NAL unit 0, skipping.
[h264 @ 0x559374083fc0] Invalid NAL unit 0, skipping.
[h264 @ 0x559374083fc0] Invalid NAL unit 0, skipping.
[h264 @ 0x559374083fc0] Invalid NAL unit 0, skipping.
[h264 @ 0x559374083fc0] no frame!

Here's one VideoFrame:

[2021-07-27T19:52:20Z INFO liborwell::rtsp::retina_client] video frame: VideoFrame { timestamp: 189124 (mod-2^32: 189124), npt 2.061, start_ctx: RtspMessageContext { pos: 42441, received_wall: WallTime(Timespec { sec: 1627415540, nsec: 361449153 }), received: Instant { tv_sec: 421014, tv_nsec: 736674930 } }, end_ctx: RtspMessageContext { pos: 42441, received_wall: WallTime(Timespec { sec: 1627415540, nsec: 361449153 }), received: Instant { tv_sec: 421014, tv_nsec: 736674930 } }, loss: 0, new_parameters: None, is_random_access_point: false, is_disposable: false, data_len: 383 }

I wrote:

I haven't tried feeding video directly from retina to ffmpeg yet, but in principle it should work. The frames should be fine to pass to ffmpeg. How are you setting up the stream with ffmpeg? You'll likely need to pass it the extra_data from VideoParameters.

The log messages from ffmpeg suggest it's not seeing a valid stream—NAL unit types should never be 0, and I think it's rare for the PPS id to be 16 rather than 0. But maybe the problem is just that without the extra data, ffmpeg is expecting a stream in Annex B format, and my code is passing it instead in AVC format. (The former means that NAL units are separated by the bytes 00 00 01, and the latter means that each NAL unit is preceded by its length in bytes as a big-endian number which can be 2, 3, or 4 bytes long. My code uses 4 bytes.) If you prefer to get Annex B data, it'd be possible to add a knob to retina to tell it that. Or conversion isn't terribly difficult: you can scan through NAL units and change the prefix to be 00 00 01.

I suppose I could add a retina example that decodes with ffmpeg into raw images or something. What ffmpeg crate are you using?

When you don't get the packet follows marked packet with same timestamp error, have you tried saving a .mp4 and playing it back in your favorite video player? Does it work?

@lucaszanella wrote:

For ffmpeg I'm using https://github.com/lucaszanella/rust-ffmpeg-1 which uses https://github.com/lucaszanella/rust-ffmpeg-sys-1 (this one is not needed, I just added some vdpau linking stuff, the original could be used). I had to modify the rust-ffmpeg-1 to add support for ffmpeg's av_parser_parse2 which parses the individual nal units. The original project doe snot have this and he doesn't want to maintain. My patch is very experimental.

I've never needed to pass additional parameters to ffmpeg, just the nal units. I extracted the h264 bitstream from a big buck bunny .mp4 file and passed to ffmpeg calling av_parser_parse2 to break into individual nal units and then passed those units using avcodec_send_packet and it works. The same process is not working for retina. When my code used to be all C++, I used to pass the output of ZLMediaKit to ffmpeg in this way also and it worked.

Even though av_parser_parse2 has the option to pass pts, dts, etc, I never used but I'll read more about these parameters.

VideoParameters debug:

Some(Video(VideoParameters { rfc6381_codec: "avc1.4D002A", pixel_dimensions: (1920, 1080), pixel_aspect_ratio: None, frame_rate: Some((2, 15)), extra_data: Length: 41 (0x29) bytes
0000:   01 4d 00 2a  ff e1 00 1a  67 4d 00 2a  9d a8 1e 00   .M.*....gM.*....
0010:   89 f9 66 e0  20 20 28 00  00 03 00 08  00 00 03 00   ..f.  (.........
0020:   7c 20 01 00  04 68 ee 3c  80                         | ...h.<. }))

I've sent you a dump of the camera via email.

If you prefer to get Annex B data, it'd be possible to add a knob to retina to tell it that. Or conversion isn't terribly difficult: you can scan through NAL units and change the prefix to be 00 00 01.

do you have experience in which types the rtsp clients out there do these things? I've never took a deep look on how ZLMediaKit does, I simply used it and now I'm getting deeper into RTSP/RTP/h264/etc because rust had no rtsp clients so I had to make one.

This is how I extracted the big buck bunny to make it work:

ffmpeg -i BigBuckBunny_512kb.mp4 -vbsf h264_mp4toannexb -vcodec copy -an big_buck_bunny_1280_720.h264

as you see by h264_mp4toannexb, it's as you supposed.

May I know why you use the AVC format in your code? Isn't the Annex B proper for streaming?

I've never needed to pass additional parameters to ffmpeg, just the nal units.

H.264 decoders need to have the parameter sets (SPS and PPS) to work. RFC 6184 says that parameter sets can be passed "in-band" (meaning as part of the RTP data), "out-of-band" (in the SDP of the DESCRIBE), or both. In your camera's case, it appears to be just out-of-band. In particular, when I look at the packet dump you sent me with Wireshark, I see that the very first RTP data is a IDR slice NAL (at packet 92), not a SPS or PPS NAL. So the out-of-band data is important.

Currently, retina doesn't copy out-of-band parameter sets into the VideoFrame's data. They're just part of the VideoParameters. So with your camera, just giving ffmpeg the video frame data isn't going to be enough, even ignoring the AVC vs Annex B format thing. I would just use ffmpeg's extra_data for this; that's what it's meant for. (Although I suppose you could instead copy it into the beginning of each IDR frame's data.)

I don't know if other RTSP clients copy parameters into the frame. I don't have any cameras that only do out-of-band data so I haven't tried this scenario.

do you have experience in which [format: Annex B or AVC] the rtsp clients out there do these things?

I know ffmpeg's built-in RTSP client uses Annex B. Not sure about others.

May I know why you use the AVC format in your code? Isn't the Annex B proper for streaming?

The format you need depends on what you're doing with it. I'm using retina to ultimately produce .mp4 files, so I need the AVC format. And it's more expensive to convert from Annex B to AVC than the reverse. To find NAL boundaries in Annex B, you have to scan through all the data for that sequence. To find them in AVC, you can just skip n bytes ahead.

I'm pretty sure you can convince ffmpeg to accept either format. But again we also could add a knob for Retina to output either. Maybe via an extra parameter to Session::setup to configure the stream.

Look at https://github.com/FFmpeg/FFmpeg/blob/master/doc/examples/decode_video.c#L162 ... It's an official example, it's where I based my examples from. As you see, it does not pass PTS or DTS, and it simply passes the raw buffer (I presume, a raw chunk of an mp4 file) to av_parse_parse2,

That example hardcodes MPEG1_VIDEO, not H.264. I don't see a great example in that directory. qsvdec.c at least is for H.264 and populates extradata, although it's for some older Intel hardware acceleration API and has a lot of cruft to deal with that.

out-of-band parameters should be the DTS and PTS being delivered in RTSP instead of inside nal units, am I right?

No, "parameters" refers to the SPS and PPS. Decoding won't work without them, so either they must be before the first slice NAL (actual encoded part of a picture) of the first frame or in extra data.

I'm reasonably sure the timestamps (DTS and PTS) don't matter to decoding. ffmpeg just copies them from its input to its output.

Could it be that these are present in mp4 but not in retina, so that's why ffmpeg does not work? Remember that it also worked in my big buck bunny video where I extracted annexb from an mp4.

In a .mp4, parameters are always supposed to be out of band within the stsd box which is the ISO 14496-12 equivalent of ffmpeg's extradata field. They are sometimes unnecessarily present in-band in the actual frame data, but that's not the case in BigBuckBunny_512kb.mp4. If you load it into https://gpac.github.io/mp4box.js/test/filereader.html you can see from the first moov.trak.mdia.minf.stbl.stco box that the first video chunk is at byte position 344307 and from that box's sibling stsz that the first video frame is 855 bytes long. You can view that with eg xxd -s 344307 -l 855 BigBuckBunny_512kb.mp4. It has the following NALs:

length 0000 01ec, header 06 (nal_ref_idc 0, nal_type SEI)
length 0000 0163, header 65 (nal_ref_idc 3, nal_type slice layer without partitioning idr)

so no SPS or PPS.

But the ffmpeg -i BigBuckBunny_512kb.mp4 -vbsf h264_mp4toannexb -vcodec copy -an big_buck_bunny_1280_720.h264 conversion command you're using copies them into the frame data. See h264_mp4toannexb_filter:

            /* If this is a new IDR picture following an IDR picture, reset the idr flag.
             * Just check first_mb_in_slice to be 0 as this is the simplest solution.
             * This could be checking idr_pic_id instead, but would complexify the parsing. */
            if (!new_idr && unit_type == H264_NAL_IDR_SLICE && (buf[1] & 0x80))
                new_idr = 1;

            /* prepend only to the first type 5 NAL unit of an IDR picture, if no sps/pps are already present */
            if (new_idr && unit_type == H264_NAL_IDR_SLICE && !sps_seen && !pps_seen) {
                if (ctx->par_out->extradata)
                    count_or_copy(&out, &out_size, ctx->par_out->extradata,
                                  ctx->par_out->extradata_size, -1, j);
                new_idr = 0;
            /* if only SPS has been seen, also insert PPS */
            } else if (new_idr && unit_type == H264_NAL_IDR_SLICE && sps_seen && !pps_seen) {
                if (!s->pps_size) {
                    LOG_ONCE(ctx, AV_LOG_WARNING, "PPS not present in the stream, nor in AVCC, stream may be unreadable\n");
                } else {
                    count_or_copy(&out, &out_size, s->pps, s->pps_size, -1, j);
                }
            }

so your .h264 file (raw Annex B, no place to put out-of-band parameters) ends up like this, with 000001 giving the boundary between NALs:

00000000: 0000 0001 0605 ffe8 dc45 e9bd e6d9 48b7  .........E....H.
...
000001f0: 0000 0001 6742 c00d ab40 d0fd ff80 1400  ....gB...@......
...
00000210: 0000 0168 ce32 c800 0001 6588 8200 1f5f  ...h.2....e...._

so now there's the SEI (0000 0001 06...), the SPS (0000 0001 67... is nal_ref_idc 3, nal_type sps), the PPS (0000 0168 is nal_ref_idc 3, nal_type pps), and then the slice layer (00 0001 65...).

Your finalize_access_unit places the length before the NAL unit and possibly the extradata (couldn't see where it places some extradata). On mine, I'd just place the 0001:

Yeah, that's the spot, and I think we could AVC vs Annex B via a per-stream option passed to Session::setup.

possibly the extradata (couldn't see where it places some extradata)

With AVC, the extra_data should be an AvcDecoderConfiguration, which is a pain to construct. It's passed along here:

https://github.com/scottlamb/retina/blob/59e513c9be90afa52c907839fa0f6c5ceb7fe61c/src/codec/h264.rs#L726

Annex B's actually easier. The extra_data is just supposed to be 00 00 01 sps 00 00 01 pps.

nvdec does not accept a bitstream of nal units.

I've never used nvdec before, but I see their docs say "After de-muxing and parsing, the client can submit the bitstream which contains a frame or field of data to hardware for decoding." That sounds like the definition of access unit, what Retina is already doing.

We could output single NALs, yes, but I'm a little afraid of confusing folks by making it unclear what forms you see when. It's not hard for the caller to break an access unit down into NALs in terms of coding or (particularly for the length-prefixed AVC form) efficiency.

Doesn't that mean passing one nal unit at a time? I remember I had to do that.

Only sometimes. There can be more than one slice NAL per frame (encoder's choice) and also SPS+PPS NALs, SEI NALs, etc.

From my quick look at NVDEC's docs, I think what Retina is doing now matches what NVDEC is expecting, with the likely exception of AVC vs Annex B encoding. If it doesn't work for some reason, then we can make changes, but speculation isn't a good way to design an easy-to-use API.

Wouldn't this be redundant as retina parses the nal units to construct a stream of nal units in AVCC format?

It's not the same thing.

Retina's RTP H.264 depacketization logic is messy because it has to understand fragmentation+aggregation packet types, deal with packet loss, and handle bugs in different IP camera models. Callers shouldn't and don't have to deal with this.

Reading a length and then that number of bytes is more straightforward.

I'm doing some spring cleaning. Is it fair to say that everything here is either fixed (#13 maybe) or covered by other issues (#19, #21, and #44 maybe)? If not, could you help me sort out what's left?

yes, ffmpeg works now, we just need to make the NAL splitter (which I have made, still didn't make a proper PR for that, but the issue is somewhere here)

scottlamb / retina

errors when feeding VStarcam data to ffmpeg #15