UE disconnects from erGW but not from the eNodeB after a while

gantonie commented 3 years ago

`erGW` version

2.8.11

`Erlang/OTP` version

23

Current behavior

Hi, I am currently trying to connect two UEs on a network with erGW and UPG-VPP. The architecture of the configuration is below. Both of the UEs are connecting to the network and can access to the Internet, ping each other and ping other addresses on the network.

However after a while, erGW considers that the UEs are disconnected while there is no problem on the eNodeB. I can not access any other addresses nor the Internet in this state.

2021-04-26T14:19:25.268688+02:00 <0.1008.0> debug: DP Call {pfcp,v1,session_deletion_request,7820677958247942088,undefined,[]}
2021-04-26T14:19:25.268763+02:00 <0.1007.0> debug: PrepSend: {send_req,{192,168,1,1},30000,5,undefined,{pfcp,v1,session_deletion_request,7820677958247942088,undefined,[]},undefined,{ergw_sx_node,response,[<0.1008.0>,{call,{<0.1134.0>,#Ref<0.3885035724.1325137929.219164>}}]},-576460617572626536}
2021-04-26T14:19:25.268896+02:00 <0.1007.0> debug: PrepSend: Session Deletion Request(V: v1, SEID: 7820677958247942088, Seq: 34): []
2021-04-26T14:19:25.269102+02:00 <0.1007.0> debug: handle message 192.168.1.1:8805: <<33,55,0,179,108,136,161,42,212,24,135,200,0,0,34,0,0,19,0,1,1,0,79,0,158,0,81,0,4,0,0,0,1,0,104,0,4,0,0,0,0,0,63,0,3,0,8,0,0,75,0,4,228,49,45,215,0,76,0,4,228,49,46,77,0,66,0,49,63,0,0,0,0,0,14,183,174,0,0,0,0,0,0,220,26,0,0,0,0,0,13,219,148,0,0,0,0,0,0,5,4,0,0,0,0,0,0,1,163,0,0,0,0,0,0,3,97,0,67,0,4,0,0,0,118,0,69,0,4,228,49,45,216,0,70,0,4,228,49,46,76,128,3,0,10,72,249,228,49,46,77,68,222,12,0,128,4,0,10,72,249,228,49,45,215,41,226,60,0,128,5,0,10,72,249,228,49,46,77,41,226,60,0>>
2021-04-26T14:19:25.269422+02:00 <0.1007.0> debug: handle message 192.168.1.1:8805: Session Deletion Response(V: v1, SEID: 7820677958247942088, Seq: 34): [#pfcp_cause{cause = 'Request accepted'}, #usage_report_sdr{, group =, #{duration_measurement => #duration_measurement{duration = 118}, end_time => #end_time{time = 3828428365}, start_time => #start_time{time = 3828428247}, time_of_first_packet => #time_of_first_packet{time = 3828428248}, time_of_last_packet => #time_of_last_packet{time = 3828428364}, tp_now => #tp_now{now = -4.249350616909769e174}, tp_start_time => #tp_start_time{start = -4.248904891549682e174}, tp_stop_time => #tp_stop_time{stop = -4.249350219115811e174}, ur_seqn => #ur_seqn{number = 0}, urr_id => #urr_id{id = 1}, usage_report_trigger =>, #usage_report_trigger{, immer = 0,droth = 0,stopt = 0,start = 0,quhti = 0, timth = 0,volth = 0,perio = 0,eveth = 0,macar = 0, envcl = 0,monit = 0,termr = 1,liusa = 0,timqu = 0, volqu = 0,emrre = 0,quvti = 0,ipmjl = 0,tebur = 0, evequ = 0}, volume_measurement =>, #volume_measurement{, total = 964526,uplink = 56346,downlink = 908180, total_pkts = 1284,uplink_pkts = 419,downlink_pkts = 865}}}]
2021-04-26T14:19:25.269869+02:00 <0.1007.0> debug: <0.1007.0>: found response: 34: Session Deletion Response(V: v1, SEID: 7820677958247942088, Seq: 34): [#pfcp_cause{cause = 'Request accepted'}, #usage_report_sdr{, group =, #{duration_measurement => #duration_measurement{duration = 118}, end_time => #end_time{time = 3828428365}, start_time => #start_time{time = 3828428247}, time_of_first_packet => #time_of_first_packet{time = 3828428248}, time_of_last_packet => #time_of_last_packet{time = 3828428364}, tp_now => #tp_now{now = -4.249350616909769e174}, tp_start_time => #tp_start_time{start = -4.248904891549682e174}, tp_stop_time => #tp_stop_time{stop = -4.249350219115811e174}, ur_seqn => #ur_seqn{number = 0}, urr_id => #urr_id{id = 1}, usage_report_trigger =>, #usage_report_trigger{, immer = 0,droth = 0,stopt = 0,start = 0,quhti = 0, timth = 0,volth = 0,perio = 0,eveth = 0,macar = 0, envcl = 0,monit = 0,termr = 1,liusa = 0,timqu = 0, volqu = 0,emrre = 0,quvti = 0,ipmjl = 0,tebur = 0, evequ = 0}, volume_measurement =>, #volume_measurement{, total = 964526,uplink = 56346,downlink = 908180, total_pkts = 1284,uplink_pkts = 419,downlink_pkts = 865}}}]
2021-04-26T14:19:25.270143+02:00 <0.1007.0> debug: send_request_reply: {send_req,{192,168,1,1},30000,5,<<33,54,0,12,108,136,161,42,212,24,135,200,0,0,34,0>>,{pfcp,v1,session_deletion_request,7820677958247942088,34,[]},undefined,{ergw_sx_node,response,[<0.1008.0>,{call,{<0.1134.0>,#Ref<0.3885035724.1325137929.219164>}}]},-576460617572626536}
2021-04-26T14:19:25.270378+02:00 <0.1134.0> debug: GxSessionOpts: #{'3GPP-SGSN-MCC-MNC' => <<"20893">>,'Calling-Station-Id' => <<"03632303839333330333">>,'3GPP-MS-TimeZone' => {128,0},'Service-Type' => 'Framed-User','Framed-Protocol' => 'GPRS-PDP-Context','Called-Station-Id' => <<"Open5GS">>,'ECGI' => <<2,248,57,0,1,155,1>>,'3GPP-Selection-Mode' => 0,'TAI' => <<2,248,57,0,7>>,'3GPP-Charging-Id' => 2110551595,owner => <0.1134.0>,'MS-Primary-DNS-Server' => <<"\b\b\b\b">>,'3GPP-IMEISV' => <<"3526411107747901">>,'3GPP-RAT-Type' => 6,'Password' => <<"ergw">>,credits => #{},'MS-Secondary-DNS-Server' => <<8,8,4,4>>,'Framed-Pool' => <<"pool-A">>,'Username' => <<"ergw">>,'Node-Id' => <<"PGW-001">>,'3GPP-NSAPI' => 5,'Session-Start' => -576460735682083398,'3GPP-PDP-Type' => 'IPv4','3GPP-GGSN-MCC-MNC' => <<"20893">>,'Charging-Rule-Base-Name' => <<"rb-0001">>,'MS-Secondary-NBNS-Server' => <<127,0,0,1>>,'Session-Id' => 58493679401027239492922607821821703684,'Framed-IPv6-Pool' => <<"pool-A">>,'Diameter-Session-Id' => <<"96d79ed31a3f;738294030;481173760;647151613443">>,'Accounting-Start' => -576460735680494057,'3GPP-Charging-Characteristics' => <<84,0>>,'Framed-IP-Address' => {10,0,180,166},'3GPP-IMSI' => <<"208930110000303">>,'3GPP-User-Location-Info' => <<130,2,248,57,0,7,2,248,57,0,1,155,1>>,'3GPP-MSISDN' => <<"03632303839333330333">>,'NAS-Identifier' => <<"NAS-Identifier">>,'MS-Primary-NBNS-Server' => <<127,0,0,1>>,'QoS-Information' => #{'APN-Aggregate-Max-Bitrate-DL' => 600000,'APN-Aggregate-Max-Bitrate-UL' => 600000,'Allocation-Retention-Priority' => #{'Pre-emption-Capability' => 0,'Pre-emption-Vulnerability' => 0,'Priority-Level' => 15},'Guaranteed-Bitrate-DL' => 0,'Guaranteed-Bitrate-UL' => 0,'Max-Requested-Bandwidth-DL' => 0,'Max-Requested-Bandwidth-UL' => 0,'QoS-Class-Identifier' => 9},'Requested-IP-Address' => {10,0,180,166},'Multi-Session-Id' => 58493679401027239492922607821821703683,'Event-Trigger' => 18,'3GPP-GGSN-Address' => {172,20,16,2},now => -576460735682058609,'Bearer-Operation' => 1,'Idle-Timeout' => 28800000,'3GPP-IMSI-MCC-MNC' => <<"20893">>,'3GPP-SGSN-Address' => {192,168,56,2}}
2021-04-26T14:19:25.270981+02:00 <0.1135.0> error: Received DOWN information for <0.1134.0> with info normal

Then, when I properly disconnect the UE from the eNodeB, erGW returns this

2021-04-26T14:19:57.076948+02:00 <0.1168.0> debug: unable to find context {request,{epc,{172,20,16,1},60259,delete_session_request,{v2,20}}}
2021-04-26T14:19:57.077046+02:00 <0.1168.0> debug: unable to find context {socket_teid_key,epc,'gtp-c',1877507345}
2021-04-26T14:19:57.077119+02:00 <0.1168.0> error: handler failed with: not_found

I only need to restart erGW (not VPP) to fix this problem temporarily. The duration the UEs stay connected to erGW varies but it is generally less than a minute.

Side notes

I also noticed this two points related to the distribution of the IP addresses by erGW.

First, I noticed that the subnet is different between the two UEs: one of them has 10.0.180.166/30 while the other has 10.0.180.128/24
I had an issue once where both of the UEs had the same address attributed

Config

The address range for the UEs is 10.0.180.100/24 - 10.0.180.200/24

This is my erGW configuration ergw-c-node.zip

Is my problem linked to a configuration problem in erGW? I thank you in advance

RoadRunnr commented 3 years ago

Most likely you are hitting the GTP-C idle timeout. That timeout is triggered is there is not GTP-C for a given context for a while. The default is 8 hours, but it can be configured and also disabled.

This is a per APN setting, here is an example to configure it https://github.com/travelping/ergw/blob/release/2.8.11/test/pgw_SUITE.erl#L175-L178

The value is in milliseconds, or the keyword infinity to disable it.

The duplicate IP happens when the UE has not yet be informed about the terminated session. It might then believe that it has an IP which in fact has already been reused by the PGW. The PGW handler will send a Delete Bearer Request to the SGW when the timeout is triggered. Teh SGW should tell the UE to release the association and reconnect. But it seems that in your case that isn't working correctly.

gantonie commented 3 years ago

I just tried with {'Idle-Timeout', infinity} and it worked fine for while then crashed like before.

RoadRunnr commented 3 years ago

What does "a while" mean? Minutes, Hours? Can you provide a full log and complete GTP-C trace?

gantonie commented 3 years ago

It generally crashes after one or two minutes. The zip folder ergw_gtpc.zip contains two files:

ergw_output, the output from erGW in debug mode. The message error: Received DOWN information for <0.1134.0> with info normal is when the UE disconnects from erGW and error: handler failed with: not_found when I manually disconnect them from the network
gtpc.pcapng, the GTPv2 packets captured during the test. The Delete Session Request happens when I manually disconnect the UEs, not when they disconnect from erGW

gantonie commented 3 years ago

I did some more testing about this issue.

This problem only seems to happen when the MME is in a separate VM. It doesn't occur when the MME is installed locally on the host machine.

Just before erGW sends the error message, the Open5GS MME also returns the following error several times:

[mme] ERROR: Unknown SGW : 192.168.56.1 (../src/mme/mme-gtp-path.c:58)

The address 192.168.56.1 corresponds to the machine hosting the VM and the erGW/VPP docker. When I add this address to the SGW list in the MME configuration, this error stops appearing but erGW still disconnect the UE after one or two minutes after the UE connection.

I also have this problem with erGW version 2.8.8. erGW 2.8.9 returns the following error at launch and doesn't seem to work.

error: crasher: initial call: ergw_cluster:'-fun.startup/0-'/0, pid: <0.1137.0>, registered_name: [], exit: {ok,[{ergw_cluster,startup,0,[{file,"/ergw/apps/ergw_cluster/src/ergw_cluster.erl"},{line,170}]},{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,211}]}]}, ancestors: [ergw_cluster,ergw_cluster_sup,<0.1134.0>], message_queue_len: 0, messages: [], links: [<0.1136.0>], dictionary: [], trap_exit: false, status: running, heap_size: 376, stack_size: 28, reductions: 185; neighbours:

RoadRunnr commented 3 years ago

The ergw_cluster error can not come from any of the 2.8.x. It was introduce to master a while back, but those changes have not made it into any release.

You are on your own (for the moment) if you try to run ergw from master.

In master the application structure has been reorganized and the configuration is not read from a JSON or YAML file. Somewhat working config files can be found at https://github.com/travelping/ergw/tree/master/apps/ergw/test/smc_SUITE_data. The ergw config should only contain the sections for kernel and jobs, the ergw_aaa section needs to be removed and the ergw section should only contain some like:

{ergw, [{type, json}, {file, "file"}]}

RoadRunnr commented 3 years ago

found the problem in the GTP-C trace

This is the Fq-TEID that the MME sends in the Create Session Request:

Fully Qualified Tunnel Endpoint Identifier (F-TEID) : S11 MME GTP-C interface, TEID/GRE Key: 0x00000008, IPv4 192.168.56.2
    IE Type: Fully Qualified Tunnel Endpoint Identifier (F-TEID) (87)
    IE Length: 9
    0000 .... = CR flag: 0
    .... 0000 = Instance: 0
    1... .... = V4: IPv4 address present
    .0.. .... = V6: IPv6 address not present
    ..00 1010 = Interface Type: S11 MME GTP-C interface (10)
    TEID/GRE Key: 0x00000008 (8)
    F-TEID IPv4: 192.168.56.2

when ergw tries to send GTP-C Echo Requests to that IP, there is no response. ergw then has to assume that the MME is down and therefor terminates all contexts on that MME. You either have to make sure that the MME is reachable at that IP and responds to GTP-C Echo Requests, or change the IP in that IE to an IP where the MME is reachable (probable 172.20.16.1)

gantonie commented 3 years ago

This was indeed the problem, the MME didn't process the packets sent by erGW because it considered them to come from the host machine rather than the SGW/SMF.

I solved this by redirecting the packets destined to the port 2123 of the host machine to the docker container and changing the MME configuration accordingly.

Thank you for your help.

travelping / ergw