nginxinc / nginx-gateway-fabric

NGINX Gateway Fabric provides an implementation for the Gateway API using NGINX as the data plane.
Apache License 2.0
511 stars 96 forks source link

Ability to set upstream zone size and keepalive settings #483

Open kate-osborn opened 1 year ago

kate-osborn commented 1 year ago

As a user of NGF I want NGF to update the upstream zone size for NGINX So that if I run into errors when due to exceeding my zone size, I can fix them.

As a user of NGF I want NGF to enable keepalive connections on my route So that I can optimize the performance of my application.

Acceptance

Dev Notes:

### Tasks
- [ ] https://github.com/nginxinc/nginx-gateway-fabric/issues/2809
- [ ] https://github.com/nginxinc/nginx-gateway-fabric/issues/2810
- [ ] https://github.com/nginxinc/nginx-gateway-fabric/issues/2811
- [ ] https://github.com/nginxinc/nginx-gateway-fabric/issues/2812
mpstefan commented 1 year ago

Would the user expect the system to do this? What is the impact on the user's system if the zone size is not dynamically updated?

brianehlert commented 1 year ago

Dynamic calculation of upstream zone accomplishes one thing, as the size of an upstream service grows it ensure that NGINX can handle that. The positive impact is that the customer can scale upstream services at will to 1000s of pods and the system will dynamically adapt. The negative impact is the growth in memory utilization as the possibility exists to high limits.

I have an entire write-up around this for NIC to do the same. This would be a later and advanced capability that has the impact of optimizing the system.

If not dynamically calculated, it needs to be exposed in a configmap for example or however system tuning is exposed. Whether auto-magic is a requirement for v1 should be discussed.

mpstefan commented 1 year ago

Today we discussed on how this would be valuable for NGINX+ as it does not require a reload when upstreams are added or removed.

brianehlert commented 1 year ago

NIC has run into a number of customer situations where customers set their limits so lean as even a back end service scaling event can cause OOM or CPUThrottling situations as a result of the configuration change. Without conscious memory consumption increases that are introduced by a feature like this.

While I think this capability is highly valuable, As I have learned more about how customers are leaning into using Quality of Service and other K8s platform requirements that force the setting of limits - I am hesitant at introducing something like this due to the feared impact on the system as a whole and a situation where the Gateway is unable to start because the configuration alone is forcing the pod into an OOM state.

mpstefan commented 1 year ago

blocked by #929

pleshakov commented 1 year ago

@brianehlert

NIC has run into a number of customer situations where customers set their limits so lean as even a back end service scaling event can cause OOM or CPUThrottling situations as a result of the configuration change. Without conscious memory consumption increases that are introduced by a feature like this.

While I think this capability is highly valuable, As I have learned more about how customers are leaning into using Quality of Service and other K8s platform requirements that force the setting of limits - I am hesitant at introducing something like this due to the feared impact on the system as a whole and a situation where the Gateway is unable to start because the configuration alone is forcing the pod into an OOM state.

The amount of configuration does affect memory consumptions of NGINX - more config you have (including TLS secrets), more memory it will consume.

Also note that our architecture includes running the control plane along the data plane, where the control plane has a cache of resources in the cluster in memory. This means that the number of those resources (including HTTPRoutes, Secrets, Endpoints... ) also directly affect memory consumption of NGF pod, without even considering the data plane.

However, traffic will much greater affect memory -- as each connection requires memory.

Additionally, configuration changes (reloading NGINX) temporarily increases memory consumption, as during a reload both old worker processes and new worker processes coexist.

Supporting dynamic calculation of zone sizes will reduce overall memory of NGINX -- because each upstream will use the amount tuned to the number of upstream servers, not some large value that will hold any amount for most cases.

Considering all that, I think dynamic calculations of zone sizes will be beneficial and it will not lead to OOMs - other things will lead to OOMs first.

kate-osborn commented 4 weeks ago

When possible, configuration updates with NGINX Plus should be made using the NGINX Plus API so NGINX is not reloaded.

  • zone size
  • keepalives connections

It doesn't look like it is possible to set zone size or keepalive connections using the N+ API. The API doesn't support updating directives for an upstream group. You can only add/modify/delete servers from upstreams: https://demo.nginx.com/swagger-ui/?_ga=2.44370660.1560926404.1730133990-1687392834.1727393286#/

kate-osborn commented 4 weeks ago

Also note:

When using load balancing methods other than the default round-robin method, it is necessary to activate them before the keepalive directive.