Closed conradludgate closed 5 hours ago
Our NLB setup round-robins between proxies.
https://www.linkedin.com/pulse/hash-flow-algorithm-aws-network-load-balancer-nlb-in-depth-mishra/
Our NLB setup round-robins between proxies.
https://www.linkedin.com/pulse/hash-flow-algorithm-aws-network-load-balancer-nlb-in-depth-mishra/
Cool, although looks like it's not public yet and I doubt it would work out of the box for postgres with TLS SNI
It doesn't go in to that detailed level, but it's likely better than just round-robin from caching perspective.
The next rather big thing would be the DNS. There are surprisingly strict limits in number of records across providers and fixed order isn't even possible for all, so another thing to implement and do the management in way or another. Then I suppose it comes down to potential savings compared to the cost of running such thing.
I have wild ideas.
Write our own DNS load balancing system.
PageServer, for a given tenant, has a set AZ (can change overtime with rebalancing). Compute has a preference for the same AZ. Proxy has no preference.
Current flow:
ep-foo-bar.region.aws.neon.tech
-> returning 3 IP addresses in a random order.Worst case:
Customer in us-east-2a connects to NLB in us-east-2b NLB connects to proxy in us-east-2c (no proxy currently running in us-east-2b) Proxy connects to compute in us-east-2a.
Suggested improvement:
The DNS server is aware that
ep-foo-bar
should be mapped tous-east-2a
by preference, so puts that IP first in the order.Write our own Load Balancer.
Our NLB setup round-robins between proxies. If we have 100 instances of proxy and 100 million endpoints in a region, we might end up caching all 100 million endpoints amongst all 100 proxy instances. It would be much better if we could use consistent hashing to make that cache a lot more efficiently packed.
Additionally, we have issues with long lived connections. Currently the pipeline is listed above, eg
Because of this, we keep proxy alive for a week after rollout.
It might be better if we have
This means we can deploy proxy (the authentication system) without interrupting our long lived connections.
This load balancer would need to be postgres and TLS aware for both cases - it needs to read SNI which for now is unencrypted. If it wants to talk to compute directly, then it needs the TLS keys.
This load balancer should have much stronger network stack integration. For instance, it should not handle TCP keepalives and should forward them through to the compute directly.
We would need to keep the LB dumb so we don't need to deploy it so much.