Open hmcts-platform-operations opened 6 months ago
scope of this may mutate but for now there's a lot of pushback implementing the changes so withdrawn.
Richard R: "had an email about 10 mins ago asking us to pause. Pushback is pretty strong, so I think it’s 90%+ probable that we’ll need to cancel it. Happy for you guys to pull the next thing on the roadmap instead"
DTSPO-17470
Summary
Currently we have a pair of Palo Alto VM Series Firewalls running behind a load balancer to give us a "active-active" setup. Unfortunately this only works with stateless networking flows. For stateful flows i.e anything involving NATs, this does not work as currently the NAT XLATES table (essentially a translation table) is not shared/synced between the two firewalls. The means if your return/subsequent traffic ends up going to the Palo that didn't originally issue the NAT it will be dropped. This causes the source and destination to constantly retry traffic, causing traffic to take >30 seconds in the best case scenario or timing out.
The current workaround is to create specific routes that send traffic via a single Palo (rather than the LoadBalancer). This obviously introduces a single point of failure and will cause confusion in the future are BAU activities like maintenance and failover of the Palos.
The intention is to (see additional information for other options that where considered):
Remove the ingress/egress NATs for SDS AKS, delete & recreate the routes for SDS AKS to point to the trust interface of the Palos, re-create the NATs on MyCloudGateway & final tidy up once complete.
Intended Outcome
Investigation has taken place and a decision has been made that modifying the current Hub Firewall setup to support this scenario would result in significantly increased cost and reduced flexibility and that the MyCloudGateway is best positioned to implement these NATs in a Stateful fashion.
Rough implementation plan in order:
Heritage NSGs oracle-azure-infrastructure
Impact on Teams
Production work to be done OOH
Investigation should be done in non-prod to determine what the downtime would be, if we can setup the NATs in CloudGateway prior to removing them from the Palos we may be able to do this with zero downtime.
Known Impacted teams/services :