Baseline scenario for performance of Rancher

git-ival commented 11 months ago

Large sub-tasks:

[ ] Automate a mechanism to pull the k8s distro's kubeconfig from the cluster/node (this is different from Rancher's generated kubeconfig)
- Input: Some combination of an IP address, a cluster ID or name, a string representing the underlying K8s distro, an SSH key or auth token
- Assume: The cluster is a provisioned cluster and is still connected to Rancher
- Output: a function that can SSH into a provisioned cluster control-plane node > run a command to output the kubeconfig as a string > return that string or an error if any
- Can be expanded to support other types of clusters if LOE is low enough
[ ] Utilize the Shepherd helm client to scale up + down the cattle-cluster-agent deployment of a given downstream cluster.
- Input: a # representing the desired replicas, a cluster object or cluster ID, the desired CATTLE_AGENT_IMAGE string
- Assume: namespace=cattle-system, ACE is enabled on all downstream clusters
- Output: a simple function that can accomplish this and return an error if any
- Set the CATTLE_AGENT_IMAGE env var on the Rancher deployment in order to ensure that it will use the desired image + repo

Scenarios:

Rancher managing no downstream Kubernetes cluster and a single simulated user using the UI
- Should measure around 0.5 CPU and 6GB Memory usage
Rancher managing 5 downstream Kubernetes clusters with 10 worker nodes each and 5 parallel simulated users using the UI
- Measure while making 700 requests per second
- Should measure around 4 CPU and 13GB Memory usage

git-ival commented 6 months ago

[ ] What framework(s) are expected to be used here?
- Is there any known pre-existing code that can help in the test effort?
- What functionality is needed in order to automate this testing as part of a regression suite?
[ ] Are there any particular cluster configurations that should be targetted for this testing?
- HA, single-node, # of CPUs, amount of RAM, K8s distro, K8s version, etc.
- Cloud provider (AWS, Azure, etc?)
[ ] What metrics must be tracked?
- What are the cutoff points for each metric that needs to be tracked? (When should we mark a given test as pass/fail based on a given metric's performance?)
- What tool(s) can be used to track the listed metrics? If Prometheus/Grafana: what query(ies)/dashboards should be used for tracking?
[ ] What benchmark testing, if any, needs to be accounted for?
- # of clusters, # of nodes, # of nodes per cluster, etc.
- # of rolebindings, # of secrets, # of namespaces, etc.

git-ival commented 6 months ago

Realistically we can use dartboard for Scenario 1, and more generally for collection of metrics/data from rancher-monitoring or Prometheus directly. This will take some implementation time depending on if additional metrics beyond those currently supported are desired.

In general, we will likely need to rely on k6 to simulate load and user activity as well as to collect metrics during that load. There will be some learning curve around k6 as Scenario 2 will rely on it heavily. Designing the users' simulated workflows will be the key challenge and could reach a very high level of complexity. As a baseline we can outline a "simple" workflow that will focus on lists/pagination across some # of downstreams per user.

As a baseline, we can assume a Rancher configuration of 4 nodes (3 all-roles, 1 worker-only for rancher-monitoring), RKE1, AWS, Rancher v2.8-latest.

This effort is primarily focused on raw # of requests per second, so other benchmark testing is not up for consideration here. We will target more specific types of requests and related metrics as part of future efforts.

git-ival commented 6 months ago

As part of our Baseline environments, we should force a number of clusters to be "disconnected". This will take some implementation work, but should be feasible

rancher / qa-tasks

Baseline scenario for performance of Rancher #1057