Is this AMF completely fault-tolerant ?

dark-astra commented 3 months ago

The 5G UE Registration call flow consists of multiple exchange of Uplink and Downlink Transport Messages. So if the AMF pod fails at some stage between these message exchanges, will the UE registration fail and restart a new registration request with the new AMF or can the UE resume the registration at the very step where it failed, with the new AMF ?

I see you are saving the context in MongoDB, but I wanted to know, is the statelessness procedural, where after the registration procedure, the context is saved or is it saved after each message exchange with RAN all along the the registration procedure, so that the registration, need not be started from the beginning if it fails, somewhere in between ?

I did a simple experiment to test this:

I installed the SDCore using Aether OnRamp and configured the gnbsim to run REGISTRATION procedure for 10 UEs. So when the gnbsim starts sending request, I delete the AMF pod, and this cause the registrations to fail.

ok: [node1] => {
"gNbsimPod.stdout_lines": [
"time="2024-08-18T17:50:22Z" level=info msg="Profile Name: profile1 , Profile Type: register" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=info msg="Ue's Passed: 6 , Ue's Failed: 4" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=info msg="Profile Errors:" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=error msg="imsi
, profile timeout" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=error msg="imsi
, profile timeout" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=error msg="imsi
, profile timeout" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=error msg="imsi
, profile timeout" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=info msg="Profile Status: FAIL" category=Summary component=GNBSIM"
]
}

My gnbsim-default.yaml looks like the following:

profiles:

profileType: register # UE Registration
profileName: profile1
enable: true
gnbName: gnb1
execInParallel: false
startImsi: 208930100007487
ueCount: 10
defaultAs: "192.168.250.1"
perUserTimeout: 10
plmnId:
mcc: 208
mnc: 93
opc: "981d464c7c52eb6e5036234984ad0bcf"
key: "5122250214c33e723a5dd523fc145fc0"
sequenceNumber: "16f3b3f70fc2"

thakurajayL commented 3 months ago

HI @dark-astra . Thank you for trying out SD-Core/Aether-onRamp.

"I see you are saving the context in MongoDB, but I wanted to know, is the statelessness procedural, where after the registration procedure, the context is saved or is it saved after each message exchange with RAN all along the the registration procedure, so that the registration, need not be started from the beginning if it fails, somewhere in between ?"

Details are stored after complete registration procedure.

dark-astra commented 3 months ago

@thakurajayL Deleting the AMF pod, makes all the consequent UE registrations fail. The gnbSim is not able to connect to the newer AMF after the older AMF is deleted. Shouldn't the gnbSim able to process the subsequent UE registrations with the new AMF pod ?

thakurajayL commented 3 months ago

Good question. There are multiple things involved here,

If you are running default gnbsim config then it means 1 request at a time and if gnbsim does not get any response then it gets stuck. You will see execInParallel configuration at 2 level. Top level if you set to true then it means run all profile in parallel. if you set execInParallel as true within profile then all the subscribers within profile are run in parallel.
Now irrespective of if you are running execInParallel true or false, if AMF does not respond then signalling is stopped. There is PR available which needs some corrections to reconnect to new AMF. This needs code correction or updating the old PR.
You can enable sctplb in the deployment and sctplb is stateless to handle the crash. if AMF crashes then newly restarted AMF work with sctplb as is.
Of course in some cases sctplb needs to resend the message. We have support to retry service request, similar support needs to be added for other messages as well. We would be happy if you want to add the code. THanks

dark-astra commented 3 months ago

@thakurajayL, Thanks for the detailed response.

Enabling the SCTP load balancer using the configuration has resolved the issue where the new AMF was not handling subsequent requests. However, I'm still encountering a problem: when the old AMF fails, a few UEs (around 3-4) experience timeouts or failures before the new AMF starts registering them again.

Even though I'm running the UE registrations sequentially, I'm puzzled as to why there are multiple timeouts or failures before the new AMF takes over.

Is there native support at the gnbsim itself, for retrying of service request ?

omec-project / amf

Is this AMF completely fault-tolerant ? #283