scottlamb / retina

High-level RTSP multimedia streaming library, in Rust
https://crates.io/crates/retina
Apache License 2.0
244 stars 48 forks source link

Reolink Bad SPS #102

Closed Curid closed 6 months ago

Curid commented 7 months ago
Bad SPS: RbspReaderError(RemainingData)
  conn: 172.17.0.2:52680(me)->192.168.1.212:46665@2024-03-03T16:49:12
  stream: TCP, interleaved channel ids 0-1
  ssrc: 40b2a2e8
  seq: 000014b6
  pkt: 49598@2024-03-03T16:49:12

reolink_working_vlc_pcap.zip

danielgoz commented 7 months ago

the camera is a Reolink Video Doorbell WiFi

scottlamb commented 7 months ago

This camera sends an "out-of-band" sequence parameter set (in the DESCRIBE response's SDP): Z2QAM6wVFKCgPZA= in base64, 67640033ac1514a0a03d90 in hex. h264_reader says this is valid. This SPS has no VUI (which is fine).

Then the camera sends another SPS "in-band" (as part of the RTP stream): 67640033ac1514a0a03da1000004f6000063380404. It's invalid according to h264_reader. RbspReaderError(RemainingData) means there was unexpected extra stuff after h264_reader parsed everything it expected to find. If I remove the last byte, it does parse. The difference between the two is as follows:

--- out-of-band sps
+++ in-band sps without last byte
@@ -1,38 +1,57 @@
 SeqParameterSet {
     profile_idc: ProfileIdc(
         100,
     ),
     constraint_flags: ConstraintFlags {
         flag0: false,
         flag1: false,
         flag2: false,
         flag3: false,
         flag4: false,
         flag5: false,
         reserved_zero_two_bits: 0,
     },
     level_idc: 51,
     seq_parameter_set_id: SeqParamSetId(
         0,
     ),
     chroma_info: ChromaInfo {
         chroma_format: YUV420,
         separate_colour_plane_flag: false,
         bit_depth_luma_minus8: 0,
         bit_depth_chroma_minus8: 0,
         qpprime_y_zero_transform_bypass_flag: false,
         scaling_matrix: SeqScalingMatrix,
     },
     log2_max_frame_num_minus4: 9,
     pic_order_cnt: TypeZero {
         log2_max_pic_order_cnt_lsb_minus4: 9,
     },
     max_num_ref_frames: 1,
     gaps_in_frame_num_value_allowed_flag: true,
     pic_width_in_mbs_minus1: 39,
     pic_height_in_map_units_minus1: 29,
     frame_mbs_flags: Frames,
     direct_8x8_inference_flag: true,
     frame_cropping: None,
-    vui_parameters: None,
+    vui_parameters: Some(
+        VuiParameters {
+            aspect_ratio_info: None,
+            overscan_appropriate: Unspecified,
+            video_signal_type: None,
+            chroma_loc_info: None,
+            timing_info: Some(
+                TimingInfo {
+                    num_units_in_tick: 1270,
+                    time_scale: 25400,
+                    fixed_frame_rate_flag: false,
+                },
+            ),
+            nal_hrd_parameters: None,
+            vcl_hrd_parameters: None,
+            low_delay_hrd_flag: None,
+            pic_struct_present_flag: false,
+            bitstream_restrictions: None,
+        },
+    ),
 }

so essentially the new SPS additionally declares a fixed frame rate of 10 fps (a tick every 1,270 / 25,400 seconds, 2 ticks per frame because H.264 supports interlaced video). That sounds perfectly plausible and, except for the extra byte, valid.

Next question: which is wrong: h264_parser or the camera? I'm skimming the H.264 spec (see here to find a copy). Section 7.3.2.1 defines the SPS data syntax. Section E.1.1 defines the VUI data syntax. To me the h264_parser code looks correct / consistent with that. Putting that together with the data minus the trailing byte looking plausible, I think the camera is actually just sending a strange extra byte at the end for some infuriating reason.

Next question: what should we do about it? I'm unsure:

  1. Accept that this camera doesn't work with Retina? Then we wouldn't be upholding a promise in the README: "Works around brokenness in cheap closed-source cameras."
  2. Make h264-parser not error if there's extra trailing data? This risks masking bugs. These bitstreams aren't exactly JSON or protobuf. They don't include the names or tag numbers for each field they're sending. If the sender and the receiver don't have identical expectations for what fields are being sent, in what order, and how many bits are used for each, the receiver will end up with gibberish. So allowing a little surprise extra stuff at the end is not terribly appealing. I guess it doesn't have to be all-or-nothing though. It's even possible to specify at the Retina level even by passing in a wrapping BitRead here that overrides the provided finish_rbsp with something more tolerant. It could log error and continue. We could choose to do this based on context and/or policy knob.
  3. When Retina receives an invalid SPS after having previously received a valid one, keep using the previously valid one? This is a bit similar to option 2. It could also mask problems that later produce gibberish, as iirc the SPS/PPS have stuff in them like how many bits are used to represent parts of the actual video data, or "quantization parameters" that...I dunno, I'm really not a codec expert, somehow also ultimately affect how the data stream turns into video. But again, we could log error and make this optional.

Thoughts?

I'm also wondering what other clients do. Did VLC log any errors/warnings when talking to this camera?

ffmpeg uses some "interesting" macros...but I think is actually checking that the rbsp trailing bits are as expected, just as h264_parser / retina do. The fixed macro appears to be passing in the expected bit as the min and max of a range to xu which then I guess bails if the actual found valid doesn't match.

#define fixed(width, name, value) do { \
        av_unused uint32_t fixed_value = value; \
        xu(width, name, fixed_value, value, value, 0, ); \
    } while (0)

// ...

#define xu(width, name, var, range_min, range_max, subs, ...) do { \
        uint32_t value; \
        CHECK(ff_cbs_read_unsigned(ctx, rw, width, #name, \
                                   SUBSCRIPTS(subs, __VA_ARGS__), \
                                   &value, range_min, range_max)); \
        var = value; \
    } while (0)

// ...

static int FUNC(rbsp_trailing_bits)(CodedBitstreamContext *ctx, RWContext *rw)
{
    int err;

    fixed(1, rbsp_stop_one_bit, 1);
    while (byte_alignment(rw) != 0)
        fixed(1, rbsp_alignment_zero_bit, 0);

    return 0;
}

// ...

static int FUNC(sps)(CodedBitstreamContext *ctx, RWContext *rw,
                     H264RawSPS *current)
{
    int err, i;

    // ...

    flag(vui_parameters_present_flag);
    if (current->vui_parameters_present_flag)
        CHECK(FUNC(vui_parameters)(ctx, rw, &current->vui, current));
    else
        CHECK(FUNC(vui_parameters_default)(ctx, rw, &current->vui, current));

    CHECK(FUNC(rbsp_trailing_bits)(ctx, rw));

    return 0;
}
scottlamb commented 7 months ago

Another point of comparison: deepch/vdk just stops when it has all the data it knows to look for, here. This program...

package main

import (
    "encoding/hex"
    "fmt"
    "github.com/deepch/vdk/codec/h264parser"
    "log"
)

func main() {
    raw, err := hex.DecodeString("67640033ac1514a0a03da1000004f6000063380404")
    if err != nil {
        log.Fatalf("hex string wouldn't parse: %v", err)
    }
    sps, err := h264parser.ParseSPS(raw)
    if err != nil {
        log.Fatalf("sps wouldn't parse: %v", err)
    }
    fmt.Printf("sps: %+v\n", sps)
}

prints: sps: {Id:0 ProfileIdc:100 LevelIdc:51 ConstraintSetFlag:0 MbWidth:40 MbHeight:30 CropLeft:0 CropRight:0 CropTop:0 CropBottom:0 Width:640 Height:480 FPS:10}

Curid commented 7 months ago

I think the camera is actually just sending a strange extra byte at the end for some infuriating reason.

Is there some reference H264 implementation we could try it with? Why just 0x04, does it repeat the last byte? It'd be interesting to dump the firmware and find out.

Next question: what should we do about it? I'm unsure:

Accept that this camera doesn't work with Retina? Then we wouldn't be upholding a promise in the README: "Works around brokenness in cheap closed-source cameras."

This may not be the only camera with this issue though, and it's harder to find replacements for door bell cameras compared to normal ones.

Make h264-parser not error if there's extra trailing data? This risks masking bugs. These bitstreams aren't exactly JSON or protobuf. They don't include the names or tag numbers for each field they're sending. If the sender and the receiver don't have identical expectations for what fields are being sent, in what order, and how many bits are used for each, the receiver will end up with gibberish. So allowing a little surprise extra stuff at the end is not terribly appealing. I guess it doesn't have to be all-or-nothing though. It's even possible to specify at the Retina level even by passing in a wrapping BitRead here that overrides the provided finish_rbsp with something more tolerant. It could log error and continue. We could choose to do this based on context and/or policy knob.

We could add a check for the in-band SPSs to see if the extra byte has length 1 and equals 0x04 to reduce the risk of masking bugs.

When Retina receives an invalid SPS after having previously received a valid one, keep using the previously valid one? This is a bit similar to option 2. It could also mask problems that later produce gibberish, as iirc the SPS/PPS have stuff in them like how many bits are used to represent parts of the actual video data, or "quantization parameters" that...I dunno, I'm really not a codec expert, somehow also ultimately affect how the data stream turns into video. But again, we could log error and make this optional.

If it randomly receives an invalid SPS several hours after the stream started I'd prefer if it threw an error.

Curid commented 7 months ago

bluenviron ignores extra bytes.

https://github.com/bluenviron/mediacommon/blob/main/pkg/codecs/h264/sps.go#L702 fmt.Printf("pos: %v size: %v\n", pos, len(buf)*8) // pos: 125 size: 136

package main

import (
    "encoding/hex"
    "fmt"
    "log"
    "github.com/bluenviron/mediacommon/pkg/codecs/h264"
)

func main() {
    raw, err := hex.DecodeString("67640033ac1514a0a03da1000004f6000063380404")
    if err != nil {
        log.Fatalf("hex string wouldn't parse: %v", err)
    }
    var sps h264.SPS
    err = sps.Unmarshal(raw)
    if err != nil {
        log.Fatalf("sps wouldn't parse: %v", err)
    }
    fmt.Printf("sps: %+v\n", sps)
        fmt.Printf("vui: %+v\n", *sps.VUI)
}

prints

sps: {
  ProfileIdc:100
  ConstraintSet0Flag:false
  ConstraintSet1Flag:false
  ConstraintSet2Flag:false
  ConstraintSet3Flag:false
  ConstraintSet4Flag:false
  ConstraintSet5Flag:false
  LevelIdc:51
  ID:0
  ChromaFormatIdc:1
  SeparateColourPlaneFlag:false
  BitDepthLumaMinus8:0
  BitDepthChromaMinus8:0
  QpprimeYZeroTransformBypassFlag:false
  ScalingList4x4:[]
  UseDefaultScalingMatrix4x4Flag:[]
  ScalingList8x8:[]
  UseDefaultScalingMatrix8x8Flag:[]
  Log2MaxFrameNumMinus4:9
  PicOrderCntType:0
  Log2MaxPicOrderCntLsbMinus4:9
  DeltaPicOrderAlwaysZeroFlag:false
  OffsetForNonRefPic:0
  OffsetForTopToBottomField:0
  OffsetForRefFrames:[]
  MaxNumRefFrames:1
  GapsInFrameNumValueAllowedFlag:true
  PicWidthInMbsMinus1:39
  PicHeightInMapUnitsMinus1:29
  FrameMbsOnlyFlag:true
  MbAdaptiveFrameFieldFlag:false
  Direct8x8InferenceFlag:true
  FrameCropping:<nil>
  VUI:0xc00009a040
}
vui: {
  AspectRatioInfoPresentFlag:false
  AspectRatioIdc:0
  SarWidth:0
  SarHeight:0
  OverscanInfoPresentFlag:false
  OverscanAppropriateFlag:false
  VideoSignalTypePresentFlag:false
  VideoFormat:0 VideoFullRangeFlag:false
  ColourDescriptionPresentFlag:false
  ColourPrimaries:0
  TransferCharacteristics:0
  MatrixCoefficients:0
  ChromaLocInfoPresentFlag:false
  ChromaSampleLocTypeTopField:0
  ChromaSampleLocTypeBottomField:0
  TimingInfo:    {NumUnitsInTick:1270 TimeScale:25400 FixedFrameRateFlag:false}
  NalHRD:<nil>
  VclHRD:<nil>
  LowDelayHrdFlag:false
  PicStructPresentFlag:false
  BitstreamRestriction:<nil>
}
scottlamb commented 7 months ago

Is there some reference H264 implementation we could try it with?

Turns out yes. https://www.itu.int/rec/T-REC-H.264.2-201602-I/en has a zip file with a bunch of stuff in it. Some of it seems to be specifically tailored for stuff like Scalable Video Codec. The basic thing seems to be the H.264.2_JM_19_0 directory; the readme says the latest version can be found at http://iphome.hhi.de/suehring/tml. The code is...garbage quality. But you can run it as follows.

$ python3 -c 'open("sps-and-pps", "wb").write(bytes.fromhex("0000000167640033ac1514a0a03da1000004f60000633804040000000168ee3cb0"))'
$ CFLAGS=-g make -j
$ lldb bin/ldecod.exe -- -p InputFile=sps-and-pps
(lldb) run

It will crash trying to interpret video frames that aren't in the file, but it seems to parse the bogus SPS first. It doesn't care about the trailing RBSP bits being correct; it just stops reading a given NAL when it finds what it wants.

I guess that's three implementations so far that just stop reading. Should we do the same? 🤷

Why just 0x04, does it repeat the last byte? It'd be interesting to dump the firmware and find out.

🤷 That level of reverse engineering is more than I can do in the time I have.

If it randomly receives an invalid SPS several hours after the stream started I'd prefer if it threw an error.

It certainly would be possible to define a policy where it only allows the invalid SPS say before the first frame of video.

Curid commented 7 months ago

I guess that's three implementations so far that just stop reading. Should we do the same? 🤷

You're right that it may mask other bugs, so it might be better to add a special exception for just this camera.

@dholroyd what do you think?

dholroyd commented 6 months ago

I can see that there is an example of other syntax in the spec where it's expected to have additional data between seq_parameter_set_data() and rbsp_trailing_bits()

subset_seq_parameter_set_rbsp( ) {
  seq_parameter_set_data( )
  ...load of stuff...
  rbsp_trailing_bits( )
}

But I can't see this being needed in a normal seq_parameter_set_rbsp(), and I can't see any flags that would allow seq_parameter_set_data() to be extended in a later version of the spec (and there's a separate NAL Unit type specifically for SPS extensions).

So I agree that the example you describe seems wrong.

How would it be if h264_reader::nal::sps::SpsError was altered to contain those parts of the SPS that were apparently parsed prior to a syntax error? Something like that would let the calling code make its own choices about how much to trust the data, and could also provide useful diagnostic context when trying to work out what's wrong with a bitstream.

This design still falls short on the diagnostic front because the data structures can't represent having parsed 'half' of a SeqParameterSet - the best that could be done is to e.g. return SeqParameterSet with vui_parameters: None if we hit an issue in vui_parameters( ) syntax.

Anyway, in this case, the error would contain the complete SeqParameterSet of course.

scottlamb commented 6 months ago

This design still falls short on the diagnostic front because the data structures can't represent having parsed 'half' of a SeqParameterSet - the best that could be done is to e.g. return SeqParameterSet with vui_parameters: None if we hit an issue in vui_parameters( ) syntax.

How would this work if the error is before some mandatory element?

I'm leaning toward a more modest thing of just adding a Retina flag to ignore the trailing data part only. I want Retina to work with a variety of cameras, but as a general practice I prefer to understand each distinct issue to the extent I can, rather than doing something broader that might paper over some unrelated bug.

scottlamb commented 6 months ago

I got another report in which the camera sent the SPS 67640033ac1514a02800f190 both in the DESCRIBE and in-band. Decoding produces RbspReaderError(RemainingData) unless I chop off the last two bytes. That lets us answer a couple questions:

When Retina receives an invalid SPS after having previously received a valid one, keep using the previously valid one?

Wouldn't work; in this case it's never received a valid SPS.

Why just 0x04, does it repeat the last byte?

In this case it's two bytes [edit: or three, extra 0x00 bytes after the last bit set are unnecessary but explicitly allowed] or three and they don't match the previous two, so apparently not.

danielgoz commented 6 months ago

Any updates? Anything else I could help with?

Curid commented 6 months ago

Another report for the same error with a reolink cx410

Bad SPS: RbspReaderError(RemainingData)
  conn: 172.22.0.2:57512(me)->192.168.254.104:554@2024-05-21T14:58:16
  stream: TCP, interleaved channel ids 0-1
  ssrc: dd247e4e seq: 13956
  pkt: 294297@2024-05-21T14:58:18
scottlamb commented 6 months ago

Any updates? Anything else I could help with?

This should be fixed on Retina's tolerate-bad-sps branch. Give it a spin? If it works, I'll prep Retina and Moonfire releases.

Curid commented 6 months ago

cx410 and the doorbell camera confirmed working!