feedback requested: PFCP synchronization?

Hi everyone,

We have an architecture where the UPS (sgwu and upf) runs on a separate machine from the CPS (hss, pcrf, mme, sgwc, smf). Occasionally the UPS (or components on it) will crash and restart, and when this happens it creates all sorts of problems relating to the state mis-match between the CPS side (sgwc and smf) and UPS side (sgwu and upf). This generally stems from the fact that when the UPS side (sgwu and upf) restarts it loses all PFCP state. Already attached UEs cannot send traffic (since no session exists), UE disconnects fail, etc etc. I often end up having to restart the entire CPS to get everything back in sync.

I have dug through the relevant 3gpp docs for PFCP and GTPv2 and can't seem to find a great way to handle this situation in a clean, 3gpp-specific way. I think I have an okay idea for a solution, but want to hear ideas and collect as much input as possible before I start working on such a fix.

0) My main idea for a fix is to treat the CPS side (sgwc and smf) as authoritative. It handles all GTP traffic normally and stays in sync with mme, etc. Whenever there is a discrepancy, we assume that the CPS side is accurate. All fixes/modifications that follow are to help the UPS figure out the CPS'es state and get in sync with it.

1) From the CPS side, treat a PFCP de-association as if the UPS has crashed. "Save" all active sessions for that UPS element in a list somewhere. If we get any GTPv2 messages relating to these sessions (e.g. modify or delete), we can handle them directly, modify our saved sessions list correctly and send a successful response, but we don't forward these messages to UPS since we have no active association. New associations fail (as standard) due to lack of PFCP context.

2) From the UPS side (sgwu and upf), when you PFCP de-associate, destroy all active sessions that are associated with that CPS. This way, no matter what, when a UPS entity successfully creates a new PFCP association (either after extended network outage or a crash, it doesnt matter) the CPS can know that the UPS has zero active sessions from it.

3) When a UPS re-associates with CPS, the CPS sends create session messages for all the "saved" sessions mentioned in (1). Based on (2) we should know that at association time the UPS has no active sessions, so this should get us in sync.

4) Related, I would like to add some CPS/UPS code to handle other edge-cases by deferring to the CPS whenever possible, and keeping an eye towards state. e.g., if the CPS sends a Delete Session Request and the UPS has no session to speak of, don't error out, respond correctly, since technically the outcome is correct.

I believe this will take some work, but not too much. Please let me know what you think, or if there's another way to handle this that I haven't considered or found online - I'm super interested in finding the best way to fix this problem.

open5gs / open5gs

feedback requested: PFCP synchronization? #1578