OCPBUGS-38490: Increase connection limit for cluster loadbalancer

mkowalski commented 2 months ago

It has been reported that for certain deployments in hub-spoke topology (a single metal cluster with 3500+ managed clusters attached), during upgrades the current limit of 20k connections to the loadbalancer is not big enough and clusters are reporting connection timeouts.

As the current limit of 20k is an arbitrary selected number and the tests report that increasing it to 40k does help for the scenario described above, we should increase the current default limit.

This PR does not change the overall recommendation to use external enterprise-grade loadbalancer for such a resource-consuming workload.

mkowalski commented 2 months ago

/cc @cybertron /cc @akrzos

cybertron commented 2 months ago

/retest-required /test e2e-openstack

This can't possibly have broken most of those jobs. Let's try again.

cybertron commented 2 months ago

Oh, and /lgtm

openshift-ci[bot] commented 2 months ago

@mkowalski: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-vsphere-ovn-zones	9092db2001470cd13b2a5d429930182d436dce2f	link	false	`/test e2e-vsphere-ovn-zones`
ci/prow/e2e-azure-ovn-upgrade-out-of-change	9092db2001470cd13b2a5d429930182d436dce2f	link	false	`/test e2e-azure-ovn-upgrade-out-of-change`
ci/prow/e2e-vsphere-ovn-upi	9092db2001470cd13b2a5d429930182d436dce2f	link	false	`/test e2e-vsphere-ovn-upi`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).

mkowalski commented 2 months ago

/retitle OCPBUGS-38490: Increase connection limit for cluster loadbalancer

openshift-ci-robot commented 2 months ago

@mkowalski: This pull request references Jira Issue OCPBUGS-38490, which is invalid:

expected the bug to target the "4.18.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4531): >It has been reported that for certain deployments in hub-spoke topology (a single metal cluster with 3500+ managed clusters attached), during upgrades the current limit of 20k connections to the loadbalancer is not big enough and clusters are reporting connection timeouts. > >As the current limit of 20k is an arbitrary selected number and the tests report that increasing it to 40k does help for the scenario described above, we should increase the current default limit. > >This PR does not change the overall recommendation to use external enterprise-grade loadbalancer for such a resource-consuming workload. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

mkowalski commented 2 months ago

/jira refresh

openshift-ci-robot commented 2 months ago

@mkowalski: This pull request references Jira Issue OCPBUGS-38490, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.18.0) matches configured target version for branch (4.18.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

In response to [this](https://github.com/openshift/machine-config-operator/pull/4531#issuecomment-2289243184): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

ptalgulk01 commented 2 months ago

Pre-merge verified: Build the image using clusterbot and deployed the cluster using template private-templates/functionality-testing/aos-4_16/ipi-on-baremetal/versioned-installer-packet_libvirt-bootstrap_static-ci

 $ oc get clusterversions
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.ci.test-2024-08-29-102444-ci-ln-bzpchy2-latest   True        False         109m    Cluster version is 4.17.0-0.ci.test-2024-08-29-102444-ci-ln-bzpchy2-latest

$ oc -n openshift-kni-infra rsh haproxy-master-0
Defaulted container "haproxy" out of: haproxy, haproxy-monitor, verify-api-int-resolvable (init)
sh-5.1$  cat /etc/haproxy/haproxy.cfg
global
  stats socket /var/lib/haproxy/run/haproxy.sock  mode 600 level admin expose-fd listeners
defaults
  maxconn 40000
  mode    tcp

We can see that maxconn is 40k here Adding label qe-approved /label qe-approved

openshift-ci-robot commented 2 months ago

@mkowalski: This pull request references Jira Issue OCPBUGS-38490, which is valid.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.18.0) matches configured target version for branch (4.18.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

In response to [this](https://github.com/openshift/machine-config-operator/pull/4531): >It has been reported that for certain deployments in hub-spoke topology (a single metal cluster with 3500+ managed clusters attached), during upgrades the current limit of 20k connections to the loadbalancer is not big enough and clusters are reporting connection timeouts. > >As the current limit of 20k is an arbitrary selected number and the tests report that increasing it to 40k does help for the scenario described above, we should increase the current default limit. > >This PR does not change the overall recommendation to use external enterprise-grade loadbalancer for such a resource-consuming workload. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

openshift-ci[bot] commented 1 week ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, mkowalski, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/machine-config-operator/blob/master/OWNERS)~~ [yuqi-zhang] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

openshift-ci-robot commented 1 week ago

/retest-required

Remaining retests: 0 against base HEAD b6d267419cac5c8a580836d7d3555b8006faeb40 and 2 for PR HEAD 9092db2001470cd13b2a5d429930182d436dce2f in total

openshift-ci-robot commented 1 week ago

@mkowalski: Jira Issue OCPBUGS-38490: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#4531

Jira Issue OCPBUGS-38490 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4531): >It has been reported that for certain deployments in hub-spoke topology (a single metal cluster with 3500+ managed clusters attached), during upgrades the current limit of 20k connections to the loadbalancer is not big enough and clusters are reporting connection timeouts. > >As the current limit of 20k is an arbitrary selected number and the tests report that increasing it to 40k does help for the scenario described above, we should increase the current default limit. > >This PR does not change the overall recommendation to use external enterprise-grade loadbalancer for such a resource-consuming workload. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

openshift-bot commented 1 week ago

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator This PR has been included in build ose-machine-config-operator-container-v4.18.0-202410230310.p0.g54144b3.assembly.stream.el9. All builds following this will include this PR.

openshift / machine-config-operator

OCPBUGS-38490: Increase connection limit for cluster loadbalancer #4531