Add H264 depacketisation

sipsorcery commented 3 years ago

Currently the RTPSession class supports H264 packetisation but not depacketisation. That leaves VP8 as the only fully supported video codec. This issue is to capture the need to add the feature. Ideally H264 packetisation logic should be refactored out of RTPSession into a separate class at the same time.

rafcsoares commented 3 years ago

@sipsorcery You can use my payload processor to do it.

Just call ProcessRTPPayload and if return != NULL, the frame is completed (Support Annex-b Slices). Just to remember, this processor does not handle packages with different timestamp as it is not a JitterVideoBuffer inself.

Ex: The frame with timestamp 100 will be lost because the Payloader will clean the buffer when receive a new timestamp (200).

seqNum 1 timestamp 100 markbit 0 seqNum 3 timestamp 100 markbit 1 seqNum 4 timestamp 200 markbit 1 seqNum 2 timestamp 100 markbit 0

Ex2: if slices of same timestamp was received in incorrect order, the payload processor can sort it so the result below is a valid H264 Frame.

seqNum 1 timestamp 100 markbit 0 seqNum 3 timestamp 100 markbit 1 seqNum 2 timestamp 100 markbit 0 seqNum 4 timestamp 200 markbit 1

/// <summary>
/// Based in https://github.com/BogdanovKirill/RtspClientSharp/blob/master/RtspClientSharp/MediaParsers/H264VideoPayloadParser.cs 
/// Distributed under MIT License
/// 
/// @author raf.csoares@kyubinteractive.com
/// </summary>

using System;
using System.Collections.Generic;
using System.IO;

namespace SIPSorcery
{
    public class H264PayloadProcessor
    {
        #region Consts

        const int SPS = 7;
        const int PPS = 8;
        const int IDR_SLICE = 1;
        const int NON_IDR_SLICE = 5;

        #endregion

        #region Private Variables

        //Payload Helper Fields
        uint previous_timestamp = 0;
        int norm, fu_a, fu_b, stap_a, stap_b, mtap16, mtap24 = 0; // used for diagnostics stats
        List<KeyValuePair<int, byte[]>> temporary_rtp_payloads = new List<KeyValuePair<int, byte[]>>(); // used to assemble the RTP packets that form one RTP Frame
        MemoryStream fragmented_nal = new MemoryStream(); // used to concatenate fragmented H264 NALs where NALs are splitted over RTP packets

        #endregion

        #region Public Functions

        public virtual MemoryStream ProcessRTPPayload(byte[] rtpPayload, ushort seqNum, uint timestamp, int markbit, out bool isKeyFrame)
        {
            List<byte[]> nal_units = ProcessRTPPayloadAsNals(rtpPayload, seqNum, timestamp, markbit, out isKeyFrame);

            if (nal_units != null)
            {
                //Calculate total buffer size
                long totalBufferSize = 0;
                for (int i = 0; i < nal_units.Count; i++)
                {
                    var nal = nal_units[i];
                    long remaining = nal.Length;

                    if (remaining > 0)
                        totalBufferSize += remaining + 4; //nal + 0001
                    else
                    {
                        nal_units.RemoveAt(i);
                        i--;
                    }
                }

                //Merge nals in same buffer using Annex-B separator (0001)
                MemoryStream data = new MemoryStream(new byte[totalBufferSize]);
                foreach (var nal in nal_units)
                {
                    data.WriteByte(0);
                    data.WriteByte(0);
                    data.WriteByte(0);
                    data.WriteByte(1);
                    data.Write(nal, 0, nal.Length);
                }
                return data;
            }
            return null;
        }

        public virtual List<byte[]> ProcessRTPPayloadAsNals(byte[] rtpPayload, ushort seqNum, uint timestamp, int markbit, out bool isKeyFrame)
        {
            List<byte[]> nal_units = ProcessH264Payload(rtpPayload, seqNum, timestamp, markbit, out isKeyFrame);

            return nal_units;
        }

        #endregion

        #region Payload Internal Functions

        protected virtual List<byte[]> ProcessH264Payload(byte[] rtp_payload, ushort seqNum, uint rtp_timestamp, int rtp_marker, out bool isKeyFrame)
        {
            if (previous_timestamp != rtp_timestamp && previous_timestamp > 0)
            {
                temporary_rtp_payloads.Clear();
                previous_timestamp = 0;
                fragmented_nal.SetLength(0);
            }

            // Add to the list of payloads for the current Frame of video
            temporary_rtp_payloads.Add(new KeyValuePair<int, byte[]>(seqNum, rtp_payload)); // TODO could optimise this and go direct to Process Frame if just 1 packet in frame
            if (rtp_marker == 1)
            {
                //Reorder to prevent UDP incorrect package order
                if (temporary_rtp_payloads.Count > 1)
                    temporary_rtp_payloads.Sort((a, b) => { return a.Key.CompareTo(b.Key); });

                // End Marker is set. Process the list of RTP Packets (forming 1 RTP frame) and save the NALs to a file
                List<byte[]> nal_units = ProcessH264PayloadFrame(temporary_rtp_payloads, out isKeyFrame);
                temporary_rtp_payloads.Clear();
                previous_timestamp = 0;
                fragmented_nal.SetLength(0);

                return nal_units;
            }
            else
            {
                isKeyFrame = false;
                previous_timestamp = rtp_timestamp;
                return null; // we don't have a frame yet. Keep accumulating RTP packets
            }
        }

        // Process a RTP Frame. A RTP Frame can consist of several RTP Packets which have the same Timestamp
        // Returns a list of NAL Units (with no 00 00 00 01 header and with no Size header)
        protected virtual List<byte[]> ProcessH264PayloadFrame(List<KeyValuePair<int, byte[]>> rtp_payloads, out bool isKeyFrame)
        {
            bool? isKeyFrameNullable = null;
            List<byte[]> nal_units = new List<byte[]>(); // Stores the NAL units for a Video Frame. May be more than one NAL unit in a video frame.

            for (int payload_index = 0; payload_index < rtp_payloads.Count; payload_index++)
            {
                // Examine the first rtp_payload and the first byte (the NAL header)
                int nal_header_f_bit = (rtp_payloads[payload_index].Value[0] >> 7) & 0x01;
                int nal_header_nri = (rtp_payloads[payload_index].Value[0] >> 5) & 0x03;
                int nal_header_type = (rtp_payloads[payload_index].Value[0] >> 0) & 0x1F;

                // If the Nal Header Type is in the range 1..23 this is a normal NAL (not fragmented)
                // So write the NAL to the file
                if (nal_header_type >= 1 && nal_header_type <= 23)
                {
                    norm++;
                    //Check if is Key Frame
                    CheckKeyFrame(nal_header_type, ref isKeyFrameNullable);

                    nal_units.Add(rtp_payloads[payload_index].Value);
                }
                // There are 4 types of Aggregation Packet (split over RTP payloads)
                else if (nal_header_type == 24)
                {
                    stap_a++;

                    // RTP packet contains multiple NALs, each with a 16 bit header
                    //   Read 16 byte size
                    //   Read NAL
                    try
                    {
                        int ptr = 1; // start after the nal_header_type which was '24'
                        // if we have at least 2 more bytes (the 16 bit size) then consume more data
                        while (ptr + 2 < (rtp_payloads[payload_index].Value.Length - 1))
                        {
                            int size = (rtp_payloads[payload_index].Value[ptr] << 8) + (rtp_payloads[payload_index].Value[ptr + 1] << 0);
                            ptr = ptr + 2;
                            byte[] nal = new byte[size];
                            Buffer.BlockCopy(rtp_payloads[payload_index].Value, ptr, nal, 0, size); // copy the NAL

                            byte reconstructed_nal_type = (byte)((nal[0] >> 0) & 0x1F);
                            //Check if is Key Frame
                            CheckKeyFrame(reconstructed_nal_type, ref isKeyFrameNullable);

                            nal_units.Add(nal); // Add to list of NALs for this RTP frame. Start Codes like 00 00 00 01 get added later
                            ptr = ptr + size;
                        }
                    }
                    catch
                    {
                    }
                }
                else if (nal_header_type == 25)
                {
                    stap_b++;
                }
                else if (nal_header_type == 26)
                {
                    mtap16++;
                }
                else if (nal_header_type == 27)
                {
                    mtap24++;
                }
                else if (nal_header_type == 28)
                {
                    fu_a++;

                    // Parse Fragmentation Unit Header
                    int fu_indicator = rtp_payloads[payload_index].Value[0];
                    int fu_header_s = (rtp_payloads[payload_index].Value[1] >> 7) & 0x01;  // start marker
                    int fu_header_e = (rtp_payloads[payload_index].Value[1] >> 6) & 0x01;  // end marker
                    int fu_header_r = (rtp_payloads[payload_index].Value[1] >> 5) & 0x01;  // reserved. should be 0
                    int fu_header_type = (rtp_payloads[payload_index].Value[1] >> 0) & 0x1F; // Original NAL unit header

                    // Check Start and End flags
                    if (fu_header_s == 1 && fu_header_e == 0)
                    {
                        // Start of Fragment.
                        // Initialise the fragmented_nal byte array
                        // Build the NAL header with the original F and NRI flags but use the the Type field from the fu_header_type
                        byte reconstructed_nal_type = (byte)((nal_header_f_bit << 7) + (nal_header_nri << 5) + fu_header_type);

                        // Empty the stream
                        fragmented_nal.SetLength(0);

                        // Add reconstructed_nal_type byte to the memory stream
                        fragmented_nal.WriteByte((byte)reconstructed_nal_type);

                        // copy the rest of the RTP payload to the memory stream
                        fragmented_nal.Write(rtp_payloads[payload_index].Value, 2, rtp_payloads[payload_index].Value.Length - 2);
                    }

                    if (fu_header_s == 0 && fu_header_e == 0)
                    {
                        // Middle part of Fragment
                        // Append this payload to the fragmented_nal
                        // Data starts after the NAL Unit Type byte and the FU Header byte
                        fragmented_nal.Write(rtp_payloads[payload_index].Value, 2, rtp_payloads[payload_index].Value.Length - 2);
                    }

                    if (fu_header_s == 0 && fu_header_e == 1)
                    {
                        // End part of Fragment
                        // Append this payload to the fragmented_nal
                        // Data starts after the NAL Unit Type byte and the FU Header byte
                        fragmented_nal.Write(rtp_payloads[payload_index].Value, 2, rtp_payloads[payload_index].Value.Length - 2);

                        var fragmeted_nal_array = fragmented_nal.ToArray();
                        byte reconstructed_nal_type = (byte)((fragmeted_nal_array[0] >> 0) & 0x1F);

                        //Check if is Key Frame
                        CheckKeyFrame(reconstructed_nal_type, ref isKeyFrameNullable);

                        // Add the NAL to the array of NAL units
                        nal_units.Add(fragmeted_nal_array);
                        fragmented_nal.SetLength(0);
                    }
                }

                else if (nal_header_type == 29)
                {
                    fu_b++;
                }
            }

            isKeyFrame = isKeyFrameNullable != null ? isKeyFrameNullable.Value : false;

            // Output all the NALs that form one RTP Frame (one frame of video)
            return nal_units;
        }

        protected void CheckKeyFrame(int nal_type, ref bool? isKeyFrame)
        {
            if (isKeyFrame == null)
            {
                isKeyFrame = nal_type == SPS || nal_type == PPS ? new bool?(true) :
                    (nal_type == NON_IDR_SLICE ? new bool?(false) : null);
            }
            else
            {
                isKeyFrame = nal_type == SPS || nal_type == PPS ?
                    (isKeyFrame.Value ? isKeyFrame : new bool?(false)) :
                    (nal_type == NON_IDR_SLICE ? new bool?(false) : isKeyFrame);
            }
        }

        #endregion
    }
}

sipsorcery commented 3 years ago

Awesome, thanks! I'll see about incorporating that class pronto.

rafcsoares commented 3 years ago

@sipsorcery i wounder if you could add a way to disable this "internal" frame processmentments to prevent duplicity in depackanization when we try to use custom logics.

As i said, this internal implementations don't use any kind of JitterBuffers.... So, if we want to perform our own custom logic to decode a Frame using OnRtpPacketReceived we are doomed to perform this 2 times, as inside RTPSession you always perform Depackanization from packets of a know format (H264/VP8).

Maybe let us decide if we want to perform this internal depackanization checking internally if (OnVideoFrameReceived != null) before call internal processors is a good way to go... what do you think?

Best regards Rafael

sipsorcery commented 3 years ago

I added that change and merged the PR.

The H264 depacketisation now works in some circumstances, which is a big improvement! In other circumstances FFmpeg is unhappy with the encoded frames:

480p from Chrome: Works
720p from Chrome: Fails
480p from MicroSIP (SIP softphone): Fails

In the same situations VP8 works.

The good thing is there is now a starting point to work from.

rafcsoares commented 3 years ago

@sipsorcery i already tested this depackanization logic with 720p in chrome

As i said before you must ensure finish a timestamp before apply another timestamp to H264PayloadProcessor.

if you receive packet in order below and put it directly on H264PayloadProcessor boths frames are lost...

seqNum 1 timestamp 100 markbit 0 seqNum 3 timestamp 100 markbit 1 seqNum 4 timestamp 200 markbit 1 seqNum 2 timestamp 100 markbit 0

KeyFrames can be surprisely long... you can receive more than 50 packets from UDP to handle one keyframe so using H264PayloadProcessor directly without check if payload is already processing another timestamp will cause a huge amount of frames lost.

Another tip is to always use FormatDescription with PackanizationMode == 1 before send SDP with H264 as this depackanization only handle this mode.

Best Regards Rafael

sipsorcery commented 3 years ago

Yep I understand what you mean regarding packet loss and out of sequence arrival. I recently added a log warning message so I can quickly identify when that occurs.

The Chrome 720p failure was my fault. In the FFmpeg logic I wasn't re-creating the pixel converted dimensions when the decoded source frame changed. I fixed that and I'm able to decode H264 at 720p and 1080p.

The MicroSIP softphone is using packetization mode 0 but I don't think that's the issue as that's what Chrome is defaulting to as well (I'm sending the SDP offer without specifying a H264 packetization mode so Chrome uses 0). I'll keep looking into this one,

rafcsoares commented 3 years ago

When using packatization mode == 0 the PPS and SPS (required to decode all frames) was not sended with KeyFrames... Instead, in chrome, the PPS/SPS was only sended in first packet so if you lose it all packages are lost.

In PackatizationMode == 1 the SPS/PPS was sended with KeyFrame so the KeyFrame will have 3 or more Nals (SPS, PPS, (IDR)*)... In this mode you can always recover from SPS/PPS when new KeyFrame arrive.

sipsorcery commented 3 years ago

From my understanding the packetization mode does not influence the H264 byte stream produced by an encoder. Whether or not the byte stream includes additional PPS and SPS NALs does not change the RTP packetisation. The RTP layer does not understand the different types of NALs. All it does is split them up and package them into RTP packets for sending on the wire. I've refactored that logic out of RTPSession now and put it into a dedicated H264Packetiser class.

It also seems like packetisation-mode 0 and 1 are well understood. I tested with two different softphones as well as Chrome and a H264 stream packetised as mode 1 was understood whether or not the parameter was set in the SDP or not.

forlayo commented 1 year ago

Could it be an issue related to packet loss ? Just basing my answer as the receiving side is somehow corrupt and the effect is like this one (more or less) -> http://www.mediapro.cc/rtp抗丢包传输方案/

If so.. I think I am on a big issue as this library doesn't have any mechanism for packet loss right ?

rafcsoares commented 1 year ago

There's no packet loss recovery mechanism by now... i started building support for it but it's not finished

forlayo commented 1 year ago

@rafcsoares so happy on hearing that, if I can help you somehow count with it! If you've the work on a branch and I could contribute or testing I'll happy on helping :)

forlayo commented 1 year ago

I saw on a bit old Microsoft example a full implementation of FEC using XOR and also Reed Solomon, just in case this helps -> https://github.com/conferencexp/conferencexp/blob/master/MSR.LST.Net.Rtp/fec.cs#L47

And Java implementation of ULPFEC (like what's currently in use in WebRTC) https://github.com/jitsi/libjitsi/tree/master/src/main/java/org/jitsi/impl/neomedia/transform/fec

H4k4nn commented 6 months ago

There's no packet loss recovery mechanism by now... i started building support for it but it's not finished

is there any packet loss rec available now? What to do if udp h264 packets drop and decoder gives error when expected timestamp/sequence frame is lost?

sipsorcery-org / sipsorcery

Add H264 depacketisation #378