ultravideo / uvgRTP

An open-source library for RTP/SRTP media delivery
BSD 2-Clause "Simplified" License
325 stars 92 forks source link

H264 Failed to flush the message queue #204

Closed jrsnen closed 9 months ago

jrsnen commented 1 year ago

I see the same issue with RTP_FORMAT_H264, both with master and 2.3.0. However when I change it to H265 the problem goes away. This is really frustrating as I really like the API and easy of use, but looking at a bunch of h.264 issues reported here I can say claiming this library supports h.264 is a very bold statement - it should come with (alpha) disclaimer.

The problem is really easy to repro and it happens on both macOS and Ubuntu 22.04, and it looks like some sort of memory corruption issue when a big frame is being fragmented. Here's the test app (forwarding RTP from one port to another):

uvgrtp::context ctx;
uvgrtp::session *sess = ctx.create_session("127.0.0.1");
uvgrtp::media_stream *stream = sess->create_stream(8890, 8891, RTP_FORMAT_H264, RTP_NO_FLAGS);

while (true) {
    uvgrtp::frame::rtp_frame* frame = stream->pull_frame(1000);
    if (frame) {
        auto ret = stream->push_frame(frame->payload, frame->payload_len, RTP_COPY);
        std::cout << frame->payload_len << " bytes, res: " << ret << std::endl;
    }
}

and the output is (see how the error pops up on a large NALU)

[uvgRTP][ERROR][::flush_queue] Failed to flush the message queue: 14
29564 bytes, res: -5
9 bytes, res: 0
4400 bytes, res: 0
9 bytes, res: 0
3189 bytes, res: 0
12 bytes, res: 0
9 bytes, res: 0

Commented by @witaly-iwanow in https://github.com/ultravideo/uvgRTP/issues/201#issuecomment-1773673366

I tried to debug sendmmsg calls and I could see it fragments a NALU into something like 1200+1200+1200+...+300 byte chunks - all good at this point, but after that there are 20-30 tiny packets with apparently bogus destination address it tries to send out

Link https://github.com/ultravideo/uvgRTP/issues/201#issuecomment-1773674705

vitaly-castLabs commented 1 year ago

I ended up using two separate sessions and it seems to be working fine:

    std::string recvAddress = "127.0.0.1";
    uint16_t recvPort = 8000;
    std::string sendAddress = "127.0.0.1";
    uint16_t sendPort = 8001;

    uvgrtp::context uvgrtpCtx;
    uvgrtp::session* recvSession = uvgrtpCtx.create_session(recvAddress);
    uvgrtp::session* sendSession = uvgrtpCtx.create_session(sendAddress);
    uvgrtp::media_stream* sendStream = recvSession->create_stream(recvPort, RTP_FORMAT_H264, RCE_RECEIVE_ONLY | RCE_NO_H26X_PREPEND_SC);
    uvgrtp::media_stream* recvStream = sendSession->create_stream(sendPort, RTP_FORMAT_H264, RCE_SEND_ONLY);
festlv commented 10 months ago

I tried using uvgRTP for RTP video stream and also discovered this issue.

Here is a gist that contains neccessary information to reproduce: https://gist.github.com/festlv/31136ab25cf1c535c5e3b817309c857e

I am encoding the packets with libav. The gist contains a file with packet data embedded in source code (for ease of reproduction, no external data is needed for main.cpp). There's also output of the main.cpp which exhibits the issue described above, and a link to raw h.264 file to verify that it can be played back (using ffplay/mplayer).

My own debugging attempts lead me to believe that the issue lies in uvgrtp::formats::h26x::push_media_frame or one of the functions it calls- either the start code lookup, aggregation or fragmentation leads to invalid data being passed to fqueue_, which then fails to be transmitted.

festlv commented 10 months ago

I did some more debugging based on the reproducible failure above and got a bit closer to the root cause. The issue seems to be caused in uvgrtp::formats::h264::fu_division: https://github.com/ultravideo/uvgRTP/blob/ecf65ba5b0a0d67ebfd71d023e1f9c62eb48274d/src/formats/h264.cc#L161-L164 Only the first three elements of the buffer vector are used, however the buffer vector is not cleared for multiple invocations of fu_division, so if the packet contains multiple NALus that need to be fragmented, there will be extra data in buffers when they are later passed to enqueue (containing nullptr entries).

jrsnen commented 10 months ago

Hello @festlv

Does reducing the MTU size in uvgRTP help with the issue? (docs folder has more on this). The reduction can be done with configure_ctx function.

BR, Joni

festlv commented 10 months ago

@jrsnen: no, it does not help, it would make matters only worse (by increasing the number of NAL units that need to be fragmented). I have a fix for this issue that allows me to continue testing this library for my application: https://github.com/festlv/uvgRTP/commit/0264699a5bbd9d9868b53bfd084dfc44646e6f59

festlv commented 10 months ago

I did some more debugging and looked at the existing tests to see why this case wasn't covered. The following test case triggers the issue (the branch containing the test case already has a fix): https://github.com/ultravideo/uvgRTP/commit/6ef2a275f5c821cdf4cb89b64ad4a30612ae7660

Root cause is: if I understand the uvgRTP design correctly, the user of the library can provide several sequential NAL units to uvgrtp::media_stream::push_frame. The NAL units are then aggregated/fragmented into RTP packets as required.

The current test cases handle testing the fragmentation of a single NAL unit, which is working correctly. The issue appears if the data buffer passed to uvgrtp::media_stream::push_frame contains several NAL units that require fragmentation. See my explanation above on the fix.

jrsnen commented 10 months ago

Ok, I'm starting to understand what is the issue. Truth be told, we haven't yet specified whether input that contains multiple NAL units that need fragmentation are supported. I'm just wondering what the RTP timestamp should be in this case.

You are right with the buffers, it seems that init_transaction should be called before the second fragmentation begins. If it does not conflict with the RFC, I could try implementing this.

BR, Joni

jrsnen commented 10 months ago

Now that I'm reading the RFC for HEVC RTP, it might not be possible to support multiple fragmentable NAL units if they belong to different access units. We could improve the documentation on this at the very least.

festlv commented 10 months ago

I am now testing uvgRTP with h.264 encoded data from ffmpeg and at least the high level APIs return a single AVPacket for the encoded frame (which consists of multiple NALs). The path of least resistance would be to provide the data contained in AVPacket as-is to uvgrtp::media_stream::push_frame.

I'm also seeing some other issues related to aggregation (#211) and possibly SCL (or some other reason that causes 0-length NAL to be parsed out of encoded data).

jrsnen commented 10 months ago

Ok, interesting. I won't promise anything, but it would be interesting to investigate our compatibility with the likes of FFmpeg and GStreamer. From how you describe it, it does sound like they are part of the same access unit, so they should work.

H.264 is generally less tested in uvgRTP since we mostly deal with H.265 and H.266. It is unfortunate that you are running into issues with it.

festlv commented 10 months ago

@jrsnen do the encoders you use typically output individual NAL units as they become available (which you directly provide to push_frame)?

festlv commented 10 months ago

I have a fix for this issue that allows me to continue testing this library for my application: festlv@0264699

This fix introduces another subtle issue- fu_headers is reused, which causes sending RTP packets with incorrect NAL type (when fragmenting several NAL units of different types belonging to the same frame).

jrsnen commented 9 months ago

I think init_transaction should be called instead of directly resetting the buffers, but I haven't yet investigated this more.

festlv commented 9 months ago

Probably the queue should be flushed before that- I have a feeling that this will have quite a bit of performance impact.

jrsnen commented 9 months ago

Nothing would suggest there would be a meaningful performance impact, the performance impact should be similar to pushing the NAL units separately, which is small. @tampsa said he will take a look at this issue when he has time. If he doesn't, I will solve this at some point.

BR, Joni

tampsa commented 9 months ago

Hello,

This issue should be fixed by the new release 3.1.0. If you run into any further problems, feel free to open a new issue on the topic!