openshift / router

Ingress controller for OpenShift
Apache License 2.0
72 stars 116 forks source link

OCPBUGS-29690: Count active services before setting weight to 1 #576

Closed gcs278 closed 6 months ago

gcs278 commented 6 months ago

The previous logic, which reduced the weight to 1 when there was only one service, didn't filter out inactive services. If there's only one active service, there's no need for a weight greater than 1 because traffic is directed only to active services.

Additionally, the template logic correctly accounted for active services when setting the algorithm, resulting in "random" being set in situations where our logic didn't reduce the active service's weight to 1.

While this isn't inherently problematic, using the "random" algorithm with higher weights significantly increases memory usage on HaProxy startup. This can lead to excessive memory usage in scenarios where a user has many inactive services or routes backends using weight 0.

openshift-ci-robot commented 6 months ago

@gcs278: This pull request references Jira Issue OCPBUGS-29690, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/router/pull/576): >The previous logic, which reduced the weight to 1 when there was only one service, didn't filter out inactive services. If there's only one active service, there's no need for a weight greater than 1 because traffic is directed only to active services. > >Additionally, the template logic correctly accounted for active services when setting the algorithm, resulting in "random" being set in situations where our logic didn't reduce the active service's weight to 1. > >While this isn't inherently problematic, using the "random" algorithm with higher weights significantly increases memory usage on HaProxy startup. This can lead to excessive memory usage in scenarios where a user has many inactive services or routes backends using weight 0. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Frouter). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
gcs278 commented 6 months ago

/assign @frobware

candita commented 6 months ago

/assign

Miciah commented 6 months ago

If you have a route that specifies spec.port and spec.alternateBackends with one service, and the alternate backend service has a port that doesn't match spec.port.to, then do we erroneously use roundrobin instead of random? Do we also need to modify createServiceAliasConfig to filter out service units without matching ports when calculating activeServiceUnits?

gcs278 commented 6 months ago

If you have a route that specifies spec.port and spec.alternateBackends with one service, and the alternate backend service has a port that doesn't match spec.port.to, then do we erroneously use roundrobin instead of random? Do we also need to modify createServiceAliasConfig to filter out service units without matching ports when calculating activeServiceUnits?

@Miciah Yes, just tried it out. You are correct indeed. Conversely, if you have a route and the spec.to service has a port that doesn't match spec.port.to, but the alternate backend has a valid port, it sets it to roundrobin erroneously as well. It's the same issue with createServiceAliasConfig. I can file a bug with a reproducer.

candita commented 6 months ago

/retest-required

candita commented 6 months ago

/lgtm

gcs278 commented 6 months ago

/jira refresh

openshift-ci-robot commented 6 months ago

@gcs278: This pull request references Jira Issue OCPBUGS-29690, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.16.0) matches configured target version for branch (4.16.0) * bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @ShudiLi

In response to [this](https://github.com/openshift/router/pull/576#issuecomment-2067192496): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Frouter). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
frobware commented 6 months ago

/approve /hold I want to get some confirmation from the customer that is running the 4.12 proposed change

frobware commented 6 months ago

/retest

frobware commented 6 months ago

Do we need to resolve https://github.com/openshift/router/pull/576#issuecomment-2065372101 in this PR too?

openshift-ci[bot] commented 6 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: frobware

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/router/blob/master/OWNERS)~~ [frobware] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
gcs278 commented 6 months ago

Do we need to resolve #576 (comment) in this PR too?

I don't think so, seems like a separate bug as I don't think they have dependency between each other. In some situations with ports on service that don't match route's spec.port.to, you'll get roundrobin in the HaProxy config file with only 1 server line, when we actually want random. With #576, we scale the weights of 1 "active" endpoints to weight 1. If you only have 1 server line weight > 1 doesn't matter, regardless what the balance algorithm is (except that random uses a lot more memory...).

In other words, I don't see any harm in changing a single server backend from weight 256 to weight 1 with using roundrobin, it's a No-Op.

ShudiLi commented 6 months ago

The weights were changed to 1

1.
% oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.ci.test-2024-04-23-081430-ci-ln-fgiq082-latest   True        False         28m     Cluster version is 4.16.0-0.ci.test-2024-04-23-081430-ci-ln-fgiq082-latest

2.
% oc get route
NAME               HOST/PORT                                                                             PATH   SERVICES           PORT    TERMINATION   WILDCARD
reencrypt1         reencrypt1-default.apps.ci-ln-fgiq082-72292.origin-ci-int-gce.dev.rhcloud.com                service-secure     https   reencrypt     None
service-unsecure   service-unsecure-default.apps.ci-ln-fgiq082-72292.origin-ci-int-gce.dev.rhcloud.com          service-unsecure   http                  None

3.
% oc -n openshift-ingress rsh router-default-7dd68dd75f-jjtlb 
sh-5.1$ grep -Rn weight haproxy.config 
127:  server fe_sni unix@/var/lib/haproxy/run/haproxy-sni.sock weight 1 send-proxy
169:  server fe_no_sni unix@/var/lib/haproxy/run/haproxy-no-sni.sock weight 1 send-proxy
227:  server pod:web-server-rc-nnbf8:service-secure:https:10.131.0.14:8443 10.131.0.14:8443 cookie 3be72d2f6b0917638bbddc5a4d310a6a weight 1 ssl verifyhost service-secure.default.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt
245:  server pod:web-server-rc-nnbf8:service-unsecure:http:10.131.0.14:8080 10.131.0.14:8080 cookie bf59e111822d7c46847171b9fe117bd4 weight 1
253:  server pod:oauth-openshift-6d759cf-ngtlk:oauth-openshift:https:10.128.0.60:6443 10.128.0.60:6443 weight 1 check inter 5000ms
254:  server pod:oauth-openshift-6d759cf-tx6fw:oauth-openshift:https:10.129.0.62:6443 10.129.0.62:6443 weight 1 check inter 5000ms
255:  server pod:oauth-openshift-6d759cf-tsxhm:oauth-openshift:https:10.130.0.76:6443 10.130.0.76:6443 weight 1 check inter 5000ms
274:  server pod:console-6967998f67-p56d7:console:https:10.128.0.64:8443 10.128.0.64:8443 cookie 8679ace67e2fa40ee6c30f71b8d6788f weight 1 ssl verifyhost console.openshift-console.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt check inter 5000ms
275:  server pod:console-6967998f67-mr7c9:console:https:10.129.0.68:8443 10.129.0.68:8443 cookie 370fef5aeed56e1697df5ffc07859b86 weight 1 ssl verifyhost console.openshift-console.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt check inter 5000ms
293:  server pod:downloads-cdb6f8949-6mgjk:downloads:http:10.128.0.42:8080 10.128.0.42:8080 cookie 9b7c18f36c60c4656fb144de59d5b2a3 weight 1 check inter 5000ms
294:  server pod:downloads-cdb6f8949-lhgck:downloads:http:10.130.0.62:8080 10.130.0.62:8080 cookie ac2f8ba61d894025df89059782bc1968 weight 1 check inter 5000ms
312:  server pod:ingress-canary-spdv5:ingress-canary:8080-tcp:10.128.2.11:8080 10.128.2.11:8080 cookie a275d1f255efa0180a7ea1f40e79c48c weight 1 check inter 5000ms
313:  server pod:ingress-canary-8wg9n:ingress-canary:8080-tcp:10.129.2.7:8080 10.129.2.7:8080 cookie 1dd0c54b43aa597e9c4a84dc786204ab weight 1 check inter 5000ms
314:  server pod:ingress-canary-qjbdz:ingress-canary:8080-tcp:10.131.0.8:8080 10.131.0.8:8080 cookie 401cc917bfb3e9a7b212f1406777f231 weight 1 check inter 5000ms
332:  server pod:alertmanager-main-0:alertmanager-main:web:10.128.2.22:9095 10.128.2.22:9095 cookie 359fec7c5c15177e0aec87fe172a0b9c weight 1 ssl verifyhost alertmanager-main.openshift-monitoring.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt check inter 5000ms
333:  server pod:alertmanager-main-1:alertmanager-main:web:10.131.0.13:9095 10.131.0.13:9095 cookie 5dcb2e178c26b447649da3439ed7484b weight 1 ssl verifyhost alertmanager-main.openshift-monitoring.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt check inter 5000ms
351:  server pod:prometheus-k8s-1:prometheus-k8s:web:10.128.2.21:9091 10.128.2.21:9091 cookie 9bcaae492092f3606e7ae8848a9fe3d7 weight 1 ssl verifyhost prometheus-k8s.openshift-monitoring.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt check inter 5000ms
352:  server pod:prometheus-k8s-0:prometheus-k8s:web:10.129.2.12:9091 10.129.2.12:9091 cookie 9c5b8f6d0b9d1ff464d16c9a6f2f0a83 weight 1 ssl verifyhost prometheus-k8s.openshift-monitoring.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt check inter 5000ms
370:  server pod:prometheus-k8s-1:prometheus-k8s:web:10.128.2.21:9091 10.128.2.21:9091 cookie 9bcaae492092f3606e7ae8848a9fe3d7 weight 1 ssl verifyhost prometheus-k8s.openshift-monitoring.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt check inter 5000ms
371:  server pod:prometheus-k8s-0:prometheus-k8s:web:10.129.2.12:9091 10.129.2.12:9091 cookie 9c5b8f6d0b9d1ff464d16c9a6f2f0a83 weight 1 ssl verifyhost prometheus-k8s.openshift-monitoring.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt check inter 5000ms
389:  server pod:thanos-querier-5c89b7859-b9dkj:thanos-querier:web:10.128.2.17:9091 10.128.2.17:9091 cookie 95776b0f70ae90779ad25efd8cc21ad5 weight 1 ssl verifyhost thanos-querier.openshift-monitoring.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt check inter 5000ms
390:  server pod:thanos-querier-5c89b7859-dwsp5:thanos-querier:web:10.131.0.11:9091 10.131.0.11:9091 cookie 76f5757f419b4a722d6065673971b937 weight 1 ssl verifyhost thanos-querier.openshift-monitoring.svc verify required ca-file /var/run/configmaps/service-ca/service-ca.crt check inter 5000ms
ShudiLi commented 6 months ago

/label qe-approved thanks

openshift-ci-robot commented 6 months ago

@gcs278: This pull request references Jira Issue OCPBUGS-29690, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.16.0) matches configured target version for branch (4.16.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @ShudiLi

In response to [this](https://github.com/openshift/router/pull/576): >The previous logic, which reduced the weight to 1 when there was only one service, didn't filter out inactive services. If there's only one active service, there's no need for a weight greater than 1 because traffic is directed only to active services. > >Additionally, the template logic correctly accounted for active services when setting the algorithm, resulting in "random" being set in situations where our logic didn't reduce the active service's weight to 1. > >While this isn't inherently problematic, using the "random" algorithm with higher weights significantly increases memory usage on HaProxy startup. This can lead to excessive memory usage in scenarios where a user has many inactive services or routes backends using weight 0. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Frouter). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
ShudiLi commented 6 months ago

The weight is 256 which is larger than the configured

backend be_http:e2e-test-router-env-n6zjv:service-unsecure01
  mode http
  option redispatch
  option forwardfor
  balance roundrobin

  timeout check 5000ms
  http-request add-header X-Forwarded-Host %[req.hdr(host)]
  http-request add-header X-Forwarded-Port %[dst_port]
  http-request add-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  http-request add-header X-Forwarded-Proto-Version h2 if { ssl_fc_alpn -i h2 }
  http-request add-header Forwarded for=%[src];host=%[req.hdr(host)];proto=%[req.hdr(X-Forwarded-Proto)]
  cookie e393e394aa38b3591557660f1e226508 insert indirect nocache httponly
  server pod:web-server-rc01-dll2p:service-unsecure01:http:10.131.0.27:8080 10.131.0.27:8080 cookie 0ea43df433f7c28f54614f5e49bbee26 weight 256 check inter 5000ms
  server pod:web-server-rc02-pc96q:service-unsecure02:http:10.131.0.28:8080 10.131.0.28:8080 cookie 1edbf08e519de047bb6e49c7901b78aa weight 256 check inter 5000ms
  server pod:web-server-rc03-xxjcg:service-unsecure03:http:10.131.0.29:8080 10.131.0.29:8080 cookie 2b78972471dc2752949b8fa0be506378 weight 256 check inter 5000ms
frobware commented 6 months ago

The weight is 256 which is larger than the configured

backend be_http:e2e-test-router-env-n6zjv:service-unsecure01
  mode http
  option redispatch
  option forwardfor
  balance roundrobin

  timeout check 5000ms
  http-request add-header X-Forwarded-Host %[req.hdr(host)]
  http-request add-header X-Forwarded-Port %[dst_port]
  http-request add-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  http-request add-header X-Forwarded-Proto-Version h2 if { ssl_fc_alpn -i h2 }
  http-request add-header Forwarded for=%[src];host=%[req.hdr(host)];proto=%[req.hdr(X-Forwarded-Proto)]
  cookie e393e394aa38b3591557660f1e226508 insert indirect nocache httponly
  server pod:web-server-rc01-dll2p:service-unsecure01:http:10.131.0.27:8080 10.131.0.27:8080 cookie 0ea43df433f7c28f54614f5e49bbee26 weight 256 check inter 5000ms
  server pod:web-server-rc02-pc96q:service-unsecure02:http:10.131.0.28:8080 10.131.0.28:8080 cookie 1edbf08e519de047bb6e49c7901b78aa weight 256 check inter 5000ms
  server pod:web-server-rc03-xxjcg:service-unsecure03:http:10.131.0.29:8080 10.131.0.29:8080 cookie 2b78972471dc2752949b8fa0be506378 weight 256 check inter 5000ms

For roundobin this is OK - no huge memory growth comes from a combination of roundrobin x 256.

frobware commented 6 months ago

/approve /hold I want to get some confirmation from the customer that is running the 4.12 proposed change

We had confirmation that the 4.12 PR works for the customer. Removing the /hold.

/hold cancel

openshift-ci[bot] commented 6 months ago

@gcs278: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci-robot commented 6 months ago

@gcs278: Jira Issue OCPBUGS-29690: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-29690 has not been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/router/pull/576): >The previous logic, which reduced the weight to 1 when there was only one service, didn't filter out inactive services. If there's only one active service, there's no need for a weight greater than 1 because traffic is directed only to active services. > >Additionally, the template logic correctly accounted for active services when setting the algorithm, resulting in "random" being set in situations where our logic didn't reduce the active service's weight to 1. > >While this isn't inherently problematic, using the "random" algorithm with higher weights significantly increases memory usage on HaProxy startup. This can lead to excessive memory usage in scenarios where a user has many inactive services or routes backends using weight 0. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Frouter). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
frobware commented 6 months ago

/cherry-pick release-4.15

openshift-cherrypick-robot commented 6 months ago

@frobware: new pull request created: #586

In response to [this](https://github.com/openshift/router/pull/576#issuecomment-2077241374): >/cherry-pick release-4.15 > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-bot commented 6 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-haproxy-router-base-container-v4.17.0-202404251439.p0.g8762aa4.assembly.stream.el9 for distgit ose-haproxy-router-base. All builds following this will include this PR.

Miciah commented 6 months ago

/jira refresh

openshift-ci-robot commented 6 months ago

@Miciah: Jira Issue OCPBUGS-29690: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-29690 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/router/pull/576#issuecomment-2090545176): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Frouter). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.