RFC3261 violation: Freeswitch sending Re-INVITE, not allowing initial INVITE to run to completion when SDP has a=inactive attribute present.

eltoddo commented 4 years ago

Bug Description: Discovered a possible issue with non-compliance of RFC 3261: an initial INVITE comes to Freeswitch with SDP containing the a=inactive media attribute. When the user attempts to answer the ringing line, Freeswitch responds with a 200 OK with SDP containing the a=inactive media attribute set. The problem is that about 1ms later Freeswitch sends a re-INVITE with the a=inactive media attribute removed, not allowing the remote SIP peer to respond with an ACK to the 200 OK. In response, the remote SIP peer sends a 500 "Server Internal Error" triggering Freeswitch to tear down the call. The user picking up the ringing line is left wondering why the call didn't connect.

The current Freeswitch version with above stated behavior: 1.10.3-release+git~20200511T105455Z~129de34d84~64bit (git 129de34 2020-05-11 10:54:55Z 64bit)

A previous Freeswitch release, used to handle this behavior by allowing the SIP peer to respond with an ACK to the 200 OK. Previous working Freeswitch release:
1.6.20+git~20180621T142146Z~1a85cfc184~64bit (git 1a85cfc 2018-06-21 14:21:46Z 64bit)

The current Freeswitch release seems to be violating RFC 3261 - Section 13.2.1 which says:

Once the UAS has sent or received an answer to the initial offer, it MUST NOT generate subsequent offers in any responses to the initial INVITE. This means that a UAS based on this specification alone can never generate subsequent offers until completion of the initial transaction.

The SIP peer is acting according to the RFC by sending 500 "Server Internal Error" in response. RFC 3261 - Section 14.2 says:

A UAS that receives a second INVITE before it sends the final response to a first INVITE with a lower CSeq sequence number on the same dialog MUST return a 500 (Server Internal Error) response to the second INVITE and MUST include a Retry-After header field with a randomly chosen value of between 0 and 10 seconds.

Presumably, this behavior is specific to the fact that the media attribute a=inactive is set on a call arriving to Freeswitch, and the fact that when the user answers, Freeswitch wants media to flow. The re-INVITE seems to be an attempt to jump-start this media flow.

Expected behavior The expected behavior is that the re-INVITE is not sent to the Freeswitch SIP peer, it should wait for the ACK of the 200 OK, as it has done in previous releases.

If this can be fixed by configuration, please point to where this behavior might be altered to allow more time after 200 OK, or for the ACK to be received. Otherwise, if designed behavior, it is expected that the issue is addressed by design change such that the re-INVITE isn't sent, and instead the dialog can complete, per the SIP standards definition.

Trace logs and Other Attachments The following are being attached to this issue:

FS log snippet for failure case, Logs set to DEBUG with UUID logging enabled.
FS log snippet for success case - Logs set to DEBUG with UUID logging enabled.
Image showing call flow for failure case - marked-up with annotations
Image showing call flow for success case - marked-up with annotations
Source packet capture of failure case (zipped)
Source packet capture of success case (zipped)

Freeswitch Logs - failure case.txt Freeswitch Logs - success case.txt Call Flow - Failure Case Call Flow - Success Case 2020-07-22.5_ECL4.4_FAILURE_ARD-to-881234.pcap.zip 2020-08-05.1_Ecl4.1_ARD-inbound_Success.pcap.zip

briankwest commented 4 years ago

Need full FreeSWITCH debug logs with sip trace enabled.

briankwest commented 4 years ago

And 'sofia loglevel all 9', the sip and sofia loglevel are critical to understand this issue.

eltoddo commented 4 years ago

Hi Brian, appreciate the fast response.

I'll see what I can do, but access to the customer's systems is very difficult. They're a major bank in a major city. Remote access is extremely restricted, I need to access by another authorized user's TeamViewer terminal just to interact with the system. Also we'll need to schedule someone to physically go to site as the building is closed due to COVID restrictions.

The earliest possible day it could be done is Monday, and even then lining up the resources might take longer. I'll see what I can do.

It may actually be faster to set up SIPp to simulate what's happening on site, which I'll pursue in parallel.

Any chance any of the attached logs can even hint at what might be going on? I'm happy to try and supply any other information I do have immediate access to.

eltoddo commented 4 years ago

I've got everything lined up so we can do this testing with appropriate traces this afternoon. I'll be doing an online meeting with the customer at 2:30 PM EDT today. I don't think it'll take more than an hour for me to perform the trace.

I made sure to test to ensure I knew how to take the appropriate traces from the FS_CLI last night. I should be good there. I'm following the guidence here: https://freeswitch.org/confluence/display/FREESWITCH/Sofia+SIP+Stack

It says:

To report a bug, you can turn on debugging with more verbose sofia global siptrace on sofia loglevel all 9 sofia tracelevel alert console loglevel debug fsctl debug_level 10

@briankwest, should I just take the trace with the first 2 commands (siptrace on, loglevel all 9)? Or would you like to see the other 3 traces enabled too? We only get one shot at working with the customer on this.

Also, I'll try and make sure my terminal has enough scrollback to copy everything out of the FS_CLI debug console. But is there a way to push the trace to a file instead so I don't have to worry about copying everything/losing anything to terminal scrollback?

eltoddo commented 4 years ago

I believe I have succeeded in taking the traces from the customer systems.

Please see the attached files which are traces of the inbound call failing on FreeSWITCH 1.10.3 and succeeding on FreeSWITCH 1.8.20.

After the traces were captured on the 1.10.3 system, it was powered down and the system running FreeSWITCH 1.8.10 was powered up, and the exact same test case was run to capture the second set of traces.

Files attached for the Failing case: 2020-08-12.1 FS 1.10.3 Inbound Call Failure - All Putty Output.log 2020-08-12.1 FS 1.10.3 Inbound Call Failure - Copied Terminal Output.txt 2020-08-12.1 FS 1.10.3 Inbound Call Failure - sngrep.pcap.zip

Files attached for the Success Case: 2020-08-12.3 FS 1.8.20 Inbound Call Success - All Putty Output.log 2020-08-12.3 FS 1.8.20 Inbound Call Success - Copied Terminal Output.txt 2020-08-12.3 FS 1.8.20 Inbound Call Success - sngrep.pcap.zip

Please let me know if there are any more details needed to resume this investigation. We're eagerly awaiting any feedback you can provide with respect to whether this behavior can be restored via configuration or if a design change will be needed.

mjerris commented 4 years ago

as noted on slack, i think this is due to bypass_media_after_bridge. Note the call is being put in park and then bridge async. if you could put details in here of how exactly you are bridging, and if this happens without bypass media after bridge, that would help

eltoddo commented 4 years ago

One of our technicians managed to reproduce this on one of our test systems today before I could get the SIPp script to do the same. At least we don't need to go to the customer system to take the traces we need! Unfortunately I don't have a test to share. Unless that would significantly help to create, we won't be pursuing a SIPp based test tool, but I do have a system we can offer remote access to.

First, you were right, the dial plan entry for extensions in our system does have "bypass_media_after_bridge=true". Here's a snippet of what that looks like in the dial plan:

<extension name="Local Extension">
        <condition expression="^(?:MSLA_\w*)?(--|1000|1001|1322|1334|2000|2001|2002|5000|5001|5003|8000|8041999|9995)$" field="destination_number">
        <action application="log" data="INFO Local Extension"/>
        <action application="set" data="ringback=${us-ring}"/>
        <action application="ring_ready"/>
        <action application="set" data="bypass_media_after_bridge=true"/>
        <action application="park"/>
    </condition>
</extension>

Second, answering the question of how we're the bridging happens. 1) Incoming call gets parked in the dial plan for the extension caught by matching the condition expression shown above. 2) Java connected application server handles park event. Recognizes incoming call to extension, notifies users with appearances of extension. 3) A user answers the ringing line by indicating by a button press on a physical key that they want to pick up the ringing line. 4) Java application server issues a command such as this one, both creating a new call to the user's registered endpoint, and bridging it with the &intercept command to the UUID of the parked call:

api originate {sip_h_X-WCS-dn='5737',origination_caller_id_number='1208',dialed_extension='1208',origination_uuid='59d5def0-dda5-11ea-aade-4b2c130d9ffa'}user/F8-DC-7A-15-19-B3-right@192.168.20.200 &intercept(56e3b5aa-dda5-11ea-aacb-4b2c130d9ffa)

Third and finally, I tried testing the failing inbound call case with the <action application="set" data="bypass_media_after_bridge=true"/> in the relevant section of the dial plan commented out. Unfortunately, this did not cause a change in the failure. I do however clearly see in the logs that where the logging of the dialplan used to say Action set(bypass_media_after_bridge=true), this is no longer seen. I took another trace from the debug console as well as a packet capture from sngrep. Those are attached. 2020-08-13.1 FS 1.10.3 Inbound Call Failure - Locally Duplicated - Copied Terminal Output.txt 2020-08-13.1 FS 1.10.3 Inbound Call Failure - Locally Duplicated.pcap.zip

Looks like removing the media bypass application flag will not affect this issue.

eltoddo commented 4 years ago

Just wanted to report an update that may indicate that this problem may be affecting more common use cases than previously thought.

Initially it seemed like this was only affecting calls from analog trunks on a Mitel media gateway PBX. However we're now finding that there are other more common situations where a current-generation Mitel 6930 IP phone registered to the Mitel PBX wants to dial an extension on our Freeswitch registered endpoints. We're finding that the failures in our regression testing of Freeswitch 1.10.3 are coming from this issue. The trigger for the issue is the a=inactive media attribute being set in the initial SDP, followed by 200 OK from FS, SDP including the a=inactive flag, then a knee-jerk re-INVITE dropping the media attribute. This results in the 500 Server Internal Error response, and failure of call completion.

surendrasignalwire commented 3 years ago

@eltoddo please remove bypass_media_after_bridge=true from dialplan.you can set it on execute_on_answer with sched_api with +1 so it will send reinvite after 1 second of 200 ok.

eltoddo commented 3 years ago

Hi @surendrasignalwire, testing with the bypass_media_after_bridge=true flag removed from the dial plan was the suggestion from @mjerris on August 13, 2020. It didn't resolve the issue then, but I figured it would be worth testing again in case I got something wrong in the change to the dial plan's configuration. Unfortunately it's still an issue, verified by testing both with and without the flag in place. The flag was verified and confirmed to be in place or not in place appropriately via log analysis of the call.

We've worked around the issue, but this standards violation has been unaddressed for 7 months now. Is this ever going to get attention? Should we be worried about the broader implications this unpatched bug has freeswitch's prioritization of standards adherence?

We're happy to try other ideas for configuration-based changes to address this issue. Even an update if you think this area of the code has been enhanced. Maybe there's a way to keep freeswitch from responding to the 500 Server Internal Error with an ACK then BYE, since all Freeswitch really needs to do is wait for the ACK to the initial 200 OK with SDP answer, which would come if the call weren't immediately torn down by Freeswitch. Maybe just send the BYE.

Anyway, we're here to bounce ideas.

seven1240 commented 1 year ago

@eltoddo do you still have this problem? can you try with and without this pr? https://github.com/signalwire/freeswitch/pull/1804

signalwire / freeswitch

RFC3261 violation: Freeswitch sending Re-INVITE, not allowing initial INVITE to run to completion when SDP has a=inactive attribute present. #778