Dynamic Configuration does not balance to all backends #3290

rlees85 closed 5 years ago

rlees85 commented 5 years ago

Similar https://github.com/kubernetes/ingress-nginx/issues/2797 but I am NOT using an external service.

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG

NGINX Ingress controller version: 0.18.0

I understand this is not latest, but 0.19.0 and 0.20.0 are broken in other ways (missing Prometheus metrics). Looking at the changelog for these versions I don't see anything around fixes for missing backends.

Kubernetes version (use kubectl version): 1.10.9


What happened:

Not all backends are being balanced too with Dynamic Configuration enabled. I have tried with round_robin and ewma.

What you expected to happen:

All backends to receive traffic.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

After doing verbose logging I can see all of my backends are being 'seen', just not used. As Dynamic configuration is all in LUA troubleshooting past this point is pretty much impossible.

I1025 09:05:41.105222       6 endpoints.go:120] Endpoints found for Service "cloud-lte1/hybris-storefront": 
[{ 8081 0 0 &ObjectReference{Kind:Pod,Namespace:cloud-lte1,Name:hybris-storefront-74cf495467-gcbpp,UID:7f4f8857-d814-11e8-9af9-06463cbe1d92,APIVersion:,ResourceVersion:2903526,FieldPath:,}} 
 { 8081 0 0 &ObjectReference{Kind:Pod,Namespace:cloud-lte1,Name:hybris-storefront-74cf495467-gtl4z,UID:547b2e00-d812-11e8-9af9-06463cbe1d92,APIVersion:,ResourceVersion:2898420,FieldPath:,}}
 { 8081 0 0 &ObjectReference{Kind:Pod,Namespace:cloud-lte1,Name:hybris-storefront-74cf495467-8cbph,UID:b2416464-d80f-11e8-9af9-06463cbe1d92,APIVersion:,ResourceVersion:2892367,FieldPath:,}}
 { 8081 0 0 &ObjectReference{Kind:Pod,Namespace:cloud-lte1,Name:hybris-storefront-74cf495467-br9tt,UID:b24446e6-d80f-11e8-9af9-06463cbe1d92,APIVersion:,ResourceVersion:2892473,FieldPath:,}}]

Reverting back to - --enable-dynamic-configuration=false and least_conn balancing everything works fine.

The graphs indicate CPU usage on the backend. With Dynamic Configuration only two ever get load. Without much more get load and the HPA starts scaling up as expected.

Slightly concerned by the fact Dynamic Configuration will be mandatory in the next release....

rlees85 commented 5 years ago

I have since tried this with version 0.20.0 and the balancing behaviour seems very strange still. Someones only one backend never gets traffic, sometimes 2 or 3. Seems to be no pattern to it

ElvinEfendi commented 5 years ago

@rlees85 given you get the above uneven load balancing, can you provide your Nginx configuration, output of kubectl get pods -owide for your app and output of kubectl exec <an ingress nginx pod> -n <namespace where ingress-nginx is deployed> -- curl -s localhost:18080/configuration/backends | jq .[]

Also are you seeing any Nginx error/warning in the logs when this happens?


I can not reproduce this, as you can see all 1000 requests are distributed almost evenly across all 10 available replicas.

> ingress-nginx (master)$ ruby count.rb
my-echo-579c44c48f-b5ffz => 99
my-echo-579c44c48f-dvgtk => 102
my-echo-579c44c48f-pzfx6 => 99
my-echo-579c44c48f-r4w2w => 99
my-echo-579c44c48f-xxc9h => 101
my-echo-579c44c48f-rvh48 => 101
my-echo-579c44c48f-v8zh6 => 101
my-echo-579c44c48f-kjxt5 => 99
my-echo-579c44c48f-slhhd => 99
my-echo-579c44c48f-sqpzg => 100
> ingress-nginx (master)$
> ingress-nginx (master)$ k get pods
NAME                       READY     STATUS    RESTARTS   AGE
my-echo-579c44c48f-b5ffz   1/1       Running   0          50m
my-echo-579c44c48f-dvgtk   1/1       Running   0          31m
my-echo-579c44c48f-kjxt5   1/1       Running   0          31m
my-echo-579c44c48f-pzfx6   1/1       Running   0          50m
my-echo-579c44c48f-r4w2w   1/1       Running   0          31m
my-echo-579c44c48f-rvh48   1/1       Running   0          31m
my-echo-579c44c48f-slhhd   1/1       Running   0          31m
my-echo-579c44c48f-sqpzg   1/1       Running   0          50m
my-echo-579c44c48f-v8zh6   1/1       Running   0          31m
my-echo-579c44c48f-xxc9h   1/1       Running   0          31m

This is using 0.180 and Round Robin. I could not reproduce this with latest master either.

rlees85 commented 5 years ago

Thanks for the response! The extra debug step to show the backends is going to be really useful. I'm away at the moment but going to get all the requested information on Monday.

With a bit of luck I'll have just done something stupid, which if that is the case I will give details and close.

rlees85 commented 5 years ago

I've re-setup this environment and am still having problems.

curl -s localhost:18080/configuration/backends from nginx:

  "name": "cloud-dt1-hybris-storefront-8081",
  "service": {
    "metadata": {
      "creationTimestamp": null
    "spec": {
      "ports": [
          "name": "hybris-http",
          "protocol": "TCP",
          "port": 8081,
          "targetPort": 8081
      "selector": {
        "app.kubernetes.io/instance": "storefront",
        "app.kubernetes.io/name": "hybris",
        "app.kubernetes.io/part-of": "hybris"
      "clusterIP": "",
      "type": "ClusterIP",
      "sessionAffinity": "None"
    "status": {
      "loadBalancer": {}
  "port": 8081,
  "secure": false,
  "secureCACert": {
    "secret": "",
    "caFilename": "",
    "pemSha": ""
  "sslPassthrough": false,
  "endpoints": [
      "address": "",
      "port": "8081",
      "maxFails": 0,
      "failTimeout": 0
      "address": "",
      "port": "8081",
      "maxFails": 0,
      "failTimeout": 0
      "address": "",
      "port": "8081",
      "maxFails": 0,
      "failTimeout": 0
      "address": "",
      "port": "8081",
      "maxFails": 0,
      "failTimeout": 0
  "sessionAffinityConfig": {
    "name": "cookie",
    "cookieSessionAffinity": {
      "name": "route",
      "hash": "sha1",
      "locations": {
        "_": [
  "name": "upstream-default-backend",
  "service": {
    "metadata": {
      "creationTimestamp": null
    "spec": {
      "ports": [
          "protocol": "TCP",
          "port": 80,
          "targetPort": 8080
      "selector": {
        "app.kubernetes.io/instance": "storefront",
        "app.kubernetes.io/name": "default-http-backend",
        "app.kubernetes.io/part-of": "nginx"
      "clusterIP": "",
      "type": "ClusterIP",
      "sessionAffinity": "None"
    "status": {
      "loadBalancer": {}
  "port": 0,
  "secure": false,
  "secureCACert": {
    "secret": "",
    "caFilename": "",
    "pemSha": ""
  "sslPassthrough": false,
  "endpoints": [
      "address": "",
      "port": "8080",
      "maxFails": 0,
      "failTimeout": 0
  "sessionAffinityConfig": {
    "name": "",
    "cookieSessionAffinity": {
      "name": "",
      "hash": ""

Get Pods (restricted to namespace - as this is how I am running nginx in namespace restricted mode)

NAME                                               READY     STATUS    RESTARTS   AGE       IP             NODE
default-http-backend-backoffice-847c84b95f-jq9hn   1/1       Running   0          1h   ip-10-81-124-154.eu-west-1.compute.internal
default-http-backend-staging-cc964d9bf-mxvl6       1/1       Running   0          1h    ip-10-81-124-154.eu-west-1.compute.internal
default-http-backend-storefront-98fc778d4-hlf6z    1/1       Running   0          1h    ip-10-81-125-145.eu-west-1.compute.internal
hybris-backoffice-566fd6fc76-4csjs                 1/1       Running   0          1h    ip-10-81-123-112.eu-west-1.compute.internal
hybris-storefront-7f9c64c9f8-9ks8j                 1/1       Running   0          1h    ip-10-81-124-178.eu-west-1.compute.internal
hybris-storefront-7f9c64c9f8-c8wxb                 1/1       Running   0          1h    ip-10-81-124-178.eu-west-1.compute.internal
hybris-storefront-7f9c64c9f8-q6rfc                 1/1       Running   0          1h    ip-10-81-123-102.eu-west-1.compute.internal
hybris-storefront-7f9c64c9f8-zhrfq                 1/1       Running   0          1h    ip-10-81-123-112.eu-west-1.compute.internal
nginx-backoffice-7467f64f7d-tb7kl                  1/1       Running   0          1h   ip-10-81-124-154.eu-west-1.compute.internal
nginx-staging-59b7bc79d6-mtnq6                     2/2       Running   0          1h   ip-10-81-124-154.eu-west-1.compute.internal
nginx-storefront-668f646b69-wkqf5                  2/2       Running   0          1h   ip-10-81-124-154.eu-west-1.compute.internal
qas-5968567cc8-q5rfb                               1/1       Running   0          1h    ip-10-81-125-145.eu-west-1.compute.internal
solrcloud-0                                        1/1       Running   1          1h    ip-10-81-124-241.eu-west-1.compute.internal
zookeeper-0                                        1/1       Running   0          1h    ip-10-81-123-244.eu-west-1.compute.internal

Ignore Hybris Backoffice, that is handled by a separate ingress.

nginx.conf (some bits omitted with the word omitted)

Additional Information

Before and After from a few cURLs (only using the first two pods, the second two only got hit by the Kubernetes healthcheck twice)

ElvinEfendi commented 5 years ago
"sessionAffinityConfig": {
    "name": "cookie",
    "cookieSessionAffinity": {
      "name": "route",
      "hash": "sha1",
      "locations": {
        "_": [

You seem to have session affinity enabled for you app (there's a known load balancing issue with current implementation), is that intentional? When session affinity is enabled load-balance annotation is ignored.

ElvinEfendi commented 5 years ago

anyone having this issue please try

RedVortex commented 5 years ago

We have the same issue and are using the same configs for stickyness. We're testing the above dev fix and we'll let you know how it goes.

RedVortex commented 5 years ago


Before and after the dev fix for sticky session for 2 pods. Before 16h40 we were using 0.20 and only one of the pod was receiving traffic (besides health checks) and thus having CPU. After 16h40, we are running version dev and traffic is well balanced and so is CPU.

We'll continue to run this dev version for now and see how it behaves for the next few days. Any idea then this will make it to a final/production release ?

Thanks !

aledbf commented 5 years ago

Any idea then this will make it to a final/production release ?

The code is already merged in master (the dev image is from master) The next release is scheduled in approx two weeks.

RedVortex commented 5 years ago

That is good news.

To give more details on the queries distribution before and after (queries per second on 2 different pods part of the same service)


You can see that the second pod was only getting the health check queries, not the client queries before the dev fix. Afterwards, the queries were well distributed.

What I cannot confirm so far (looking into this as we speak) is if stickyness is still respected by the new code. I'm unsure if there is an automated test in the build that checks if stickyness works properly so I'm checking this manually to be safe.

RedVortex commented 5 years ago

Cookie-based stickyness still works well.

All incoming connections with a cookie presented are directed to the same pod, as it should be. Incoming connections without a cookie are load-balanced between the 2 pods, as it should be.

I'll follow-up again after a few days to confirm everything still works well but so far so good !

rlees85 commented 5 years ago

Great news, yes my stickiness is intentional... Hybris seems to work better if sessions are not passed around nodes any more than they need to be. I didn't know there was a known issue with stickiness, but am happy to know the issue is known and likely fixed.