Scale and Performance testing for 1.27 k8s clusters

vivek-shilimkar commented 1 year ago

We need to do performance testing regarding large clusters to see how it performs..

Specific usecases:

Local and downstream 1.27 RKE1 and RKE2 clusters - individual cluster performance.
Large number of downstream RKE1/RKE2 Rancher provisioned clusters.

git-ival commented 1 year ago

Blocked by https://github.com/rancher/rancher/issues/43389, rancher-monitoring is relied upon for monitoring metrics

git-ival commented 1 year ago

RKE1 Perf Checks Summary of Results

Rancher Cluster Config:

Rancher version: 2.7.8 Rancher cluster type: RKE1 HA (3 all-roles nodes, 1 worker node tainted to run rancher-monitoring) Cluster Size: Medium Docker version: 20.10 K8s version: v1.26.8-rancher1-1 Log level: debug certmanager version: 1.11.1

Downstream Cluster Config:

Downstream cluster types: (3 workers, 1 etcd, 1 control plane)

RKE1 Node Driver
RKE1 Custom Cluster
RKE2 Node Driver
RKE2 Custom Cluster
K3s Node Driver
K3s Custom Cluster Installed apps: rancher-monitoring Deployments: helm-foldingathome rke1 version = v1.26.8-rancher1-1 (default) rke2 version = v1.26.8+rke2r1 (default) k3s version = v1.26.8+k3s1

Method:

Setup HA Local (k8s version) cluster with Rancher 2.7.8
1. Install rancher-monitoring chart from Rancher catalog
Create HA Downstream clusters:
- RKE1 Node Driver
- RKE1 Custom
- RKE2 Node Driver
- RKE2 Custom
- K3s Node Driver
- K3s Custom
Load each downstream cluster with a set of "bulk components"
1. 100 secrets
2. 10 projects
3. 12 namespaces
4. 300 users (users are created in the Local cluster)
5. 1 project Role Template
6. 300 project Role Template bindings
Deploy rancher-monitoring and helm-foldingathome to each downstream cluster
1. Install rancher-monitoring chart from Rancher catalog
2. Add chart repo for helm-foldingathome
3. Install helm-foldingathome
4. Finetune the # of replicas for each individual cluster so that the total CPU usage hovers around ~70%
Let Idle for ~30 minutes or more
Run parallel soak test and object count script on a schedule, every X minutes
- The typical schedule is every hour, assuming the minimum soak period of 30 minutes
Allow the environment to continue with the scheduled object counts and soak tests.
- Ideally the environment will have 1-3 days of activity before continuing with the next steps
Collect Grafana and Prometheus metrics for each cluster
Collect soak test and object count artifacts
Complete rancher upgrade to latest rc (2.8.0-rc1)
1. Upgrade Rancher
2. Upgrade K8s on Local to 1.27.6
3. Upgrade rancher-monitoring on Local
4. Upgrade K8s on Downstreams to 1.27.6
5. Upgrade rancher-monitoring on Downstreams
Repeat steps 6-9
Verify that resource usage and controller metrics are meeting expectations based on the known changes to the product
- Ideally this should improve overall, but this has not always been the case
  1. The pre- and post- upgrade metrics are either equivalent, or have improved post-upgrade. Showing an improvement or maintenance of expected performance
    - If the release adds known overhead this criterion can be ignored
  2. There are no new CPU or Memory leaks
  3. There are no noticeable slowdowns or other issues with UI performance
    Results:
- Ran into this RKE cli issue when running rke up in order to upgrade the local K8s version
- After retrying 3 times, the local cluster upgraded successfully
- Seemingly ran into one (or both) of the following issues:
- https://www.suse.com/support/kb/doc/?id=000020910
- https://github.com/rancher/rancher/issues/43096 (see my comment on this issue)
- Had a # of clusters showing cluster agent is not connected error
- managed to fix all but the rke1-custom downstream cluster
- Observed lots of instances of "client-side throttling" during soak tests
- See this blog post for information about why this log line appears
- During this performance check the soak test script was fixed so that it correctly runs iterations on each node for a configured maximum duration of time (30 minutes)
- This max duration is not a hard cutoff, but it will stop any further iterations as soon as the current iteration completes
- The diff for all object counts was automatically generated for this run
- CPU and Memory utilization remained roughly the same pre- and post-upgrade
- Overall spikes in resource usage align with periods where downstreams were being loaded up with "objects" and when soak tests/object counts were being collected
- Additionally, smaller and more frequent spikes can be seen which align with the foldingathome workloads and the slightly wider periods of activity can be attributed to the soak test periods
- Disk Utilization is noticeably higher than observed during pre-upgrade phase by about 10%
- Disk I/O largest spike is roughly 2x smaller than observed during pre-upgrade phase
- Network Traffic had slightly smaller spikes than observed during pre-upgrade phase
- Network I/O slightly smaller and the largest spike is about 1.5x smaller than observed during pre-upgrade phase
- Load Average spikes had slightly less amplitude on average than observed during pre-upgrade phase

See RKE1 perf check screenshots and other artifacts here: Local RKE1.zip

git-ival commented 1 year ago

RKE2 Perf Checks Summary of Results

Rancher Cluster Config:

Rancher version: 2.7.8 Rancher cluster type: RKE2 HA (3 all-roles nodes, 1 worker node tainted to run rancher-monitoring) Cluster Size: Medium Docker version: 20.10 K8s version: v1.26.8+rke2r1 Log level: debug certmanager version: 1.11.1

Downstream Cluster Config:

Downstream cluster types: (3 workers, 1 etcd, 1 control plane)

RKE1 Node Driver
RKE1 Custom Cluster
RKE2 Node Driver
RKE2 Custom Cluster
K3s Node Driver
K3s Custom Cluster Installed apps: rancher-monitoring Deployments: helm-foldingathome rke1 version = v1.26.8-rancher1-1 (default) rke2 version = v1.26.8+rke2r1 (default) k3s version = v1.26.8+k3s1

Method:

Setup HA Local (k8s version) cluster with Rancher 2.7.8
1. Install rancher-monitoring chart from Rancher catalog
Create HA Downstream clusters:
- RKE1 Node Driver
- RKE1 Custom
- RKE2 Node Driver
- RKE2 Custom
- K3s Node Driver
- K3s Custom
Load each downstream cluster with a set of "bulk components"
1. 100 secrets
2. 10 projects
3. 12 namespaces
4. 300 users (users are created in the Local cluster)
5. 1 project Role Template
6. 300 project Role Template bindings
Deploy rancher-monitoring and helm-foldingathome to each downstream cluster
1. Install rancher-monitoring chart from Rancher catalog
2. Add chart repo for helm-foldingathome
3. Install helm-foldingathome
4. Finetune the # of replicas for each individual cluster so that the total CPU usage hovers around ~70%
Let Idle for ~30 minutes or more
Run parallel soak test and object count script on a schedule, every X minutes
- The typical schedule is every hour, assuming the minimum soak period of 30 minutes
Allow the environment to continue with the scheduled object counts and soak tests.
- Ideally the environment will have 1-3 days of activity before continuing with the next steps
Collect Grafana and Prometheus metrics for each cluster
Collect soak test and object count artifacts
Complete rancher upgrade to latest rc (2.8.0-rc1)
1. Upgrade Rancher
2. Upgrade K8s on Local
3. Upgrade rancher-monitoring on Local
4. Upgrade K8s on Downstreams
5. Upgrade rancher-monitoring on Downstreams
Repeat steps 6-9
Verify that resource usage and controller metrics are meeting expectations based on the known changes to the product
- Ideally this should improve overall, but this has not always been the case
  1. The pre- and post- upgrade metrics are either equivalent, or have improved post-upgrade. Showing an improvement or maintenance of expected performance
    - If the release adds known overhead this criterion can be ignored
  2. There are no new CPU or Memory leaks
  3. There are no noticeable slowdowns or other issues with UI performance
    Results:
- Seemingly ran into the following issue: https://github.com/rancher/rancher/issues/38033
- Attempted to resolve by upgrading the cluster via the system-upgrade-controller method (cluster status remains as Updating)
- The cluster was still accessible via the Rancher UI, however the etcd node was not successfully upgraded
- Observed lots of instances of "client-side throttling" during soak tests
- See this blog post for information about why this log line appears
- During this performance check the soak test script was fixed so that it correctly runs iterations on each node for a configured maximum duration of time (30 minutes)
- This max duration is not a hard cutoff, but it will stop any further iterations as soon as the current iteration completes
- The diff for all object counts was automatically generated for this run
- CPU utilization was very slightly lower post-upgrade, with less spiking
- Memory utilization has about a 20% increase post-upgrade
- Overall spikes in resource usage align with periods where downstreams were being loaded up with "objects" and when soak tests/object counts were being collected
- Additionally, smaller and more frequent spikes can be seen which align with the foldingathome workloads and the slightly wider periods of activity can be attributed to the soak test periods
- Disk Utilization is roughly 2x higher (for 2 of the nodes) than observed during pre-upgrade phase
- Disk I/O largest spike is over 2x smaller than observed during pre-upgrade phase
- Network Traffic had slightly larger spikes than observed during pre-upgrade phase
- Network I/O average traffic is about the same with the largest spike being about 2x smaller than observed during pre-upgrade phase
- Load Average spikes had slightly less amplitude on average than observed during pre-upgrade phase
- Overall 409s (primarily around the clusters and etcdsnapshots resources) were more frequent post-upgrade

See RKE2 perf check screenshots and other artifacts here: Local RKE2.zip

git-ival commented 1 year ago

Reopening for completion of scalability testing

git-ival commented 1 year ago

RKE2 scale check has been completed, summary writeup still in-progress.

RKE1 scale check is blocked by https://github.com/rancher/terraform-provider-rancher2/issues/1258, working on an alternative solution.

git-ival commented 1 year ago

RKE2 Scale Checks Summary of Results

Rancher Cluster Config:

Rancher version: v2.8.0-rc1 Rancher cluster type: RKE2 HA , 3 all-roles nodes, 1 worker node (tainted to run rancher-monitoring) Cluster Size: Medium Docker version: 20.10 K8s version: v1.27.6+rke2r1

Downstream Cluster Config:

Downstream cluster type: AWS Nodedriver RKE2, 1 all-roles node Installed apps: rancher-monitoring K8s version: v1.27.6+rke2r1

Results:

Reached # of clusters: 287
Experienced multiple instances of Rancher UI freezing/failing to load
Started hitting 503 errors around ~265 clusters
Was not able to collect the final set of remote dialer or K8s api server metrics due to 503 errors

See RKE2 scale check screenshots and other artifacts here: RKE2.zip

Rancher Cluster: local-rke2-rancher-metrics

K8s API Server: N/A

Rancher Perf: local-rke2-perf-metrics1

local-rke2-perf-metrics2

local-rke2-perf-metrics3

rancher / qa-tasks