Provide automated load testing of KEDA

tomkerkhove commented 1 year ago

A frequent ask during our CNCF Graduation discussions are if we have load testing / performance benchmarking of KEDA which we do not have today.

There are a few reasons for that:

KEDA integrates with a plethora of external dependencies which can all influence our performance and give inconsistent/false impression
KEDA relies on Kubernetes' HPA so we want to avoid doing load testing of the HPA
We know KEDA is used in large deployments of 1000's of ScaledObjects

However, the latter is purely informative and not something which we manage so I have opened https://github.com/kedacore/keda/discussions/4410 so that end-users can chime in and share if they are comfortable.

I do believe this is a reasonable ask but it is not something that we can easily do; however, I like challenges.

Let's brainstorm approaches!

tomkerkhove commented 1 year ago

First idea:

Benchmark metric server by cutting out our scalers / HPA

Scenario: We can introduce a new external scaler for testing purposes. By doing so, we could create 1000(s) of ScaledObjects for 1000(s) of test apps and trigger scaling while we do load testing on metric server as if it would be Kubernetes attempting to fetch metrics.

An automated service such as k9/artillery/Azure Load Testing can be used to generate traffic on our metric server and give a report of its performance. I have used Azure Load Testing which is based on JMeter and offers good report/comparison between runs but I'm open to ideas.

Pro:

We should have an understanding of the performance of our metric server
We don't depend on performance of HPA nor external scalers

Con:

We'll need to use a lot of compute, but is it really worth it? (🌳🌳)
More work as we need to build all of this, along with a test external scaler
Is KEDA slow, or the test external scaler?

JorTurFer commented 1 year ago

I have been talking about this with @mrsdaehin (she is our performance expert, and she is really top):

As we have a single endpoint to be tested (/apis/external.metrics.k8s.io/v1beta1/namespaces/X/Y), the proposal is to generate some test/benchmark cases based on 3 parameters: ScaledObject count, Triggers per ScaledObject, concurrent requests (now the HPA controller is single thread but multithread will be released soon). This could give as a vision about how KEDA performs in different scenarios.

To achieve this, we have talked about using a tool like go-wrk and a golang script as it's all written in golang (to have a single language for everything), but other tools like Grafana K6 could be more useful than go-wrk if we want to extend those test cases in the future to cover other features like admission webhooks, chaos tests, etc. In any case, as we want to test the metrics server, we proposed to execute it inside the cluster and not from any external service (because in that case, we'd have to deal with the authentication, and internally we can use a service account token for that)

To not be affected by external scalers (and measure only KEDA, not the deps), we have some options:

Use only internal scalers like cron
Use a mock API to simulate external scaler with different behaviors (this could be useful to benchmark the cache)

Depending on the scope that we want to cover, options can be different.

About the computing, I think that we can reduce the impact of creating/destroying the infrastructure on demand and performing the test daily/weekly/biweekly, but as you said, the performance test has been a frequent ask and I guess that having the metrics are useful for everybody.

Maybe I have left something from our conversation @mrsdaehin so feel free to correct me where you think that I'm wrong or to extend whatever you think is necessary.

MrsDaehin commented 1 year ago

Two ways of approaching the problem

1.- Benchmarking: the performance of a component under load comparing different configurations 2.- Load Testing to find bottlenecks/limits in order to dimension a certain system in a "fix" status ( configuration, system under test etc .. )

So standards:

For Benchmarking: httperf old benchmark tool more or less like apache benchmark. WRK/WRK2 is a "modern benchmark tool" and all the ones developed after it like the go-wrk i told @JorTurFer

For Load Testing:

Jmeter and its friends: Blazemeter Taurus, Jmeter-DSL etc .. all based in java :( I am more fan of Taurus because it is not XML. Jmeter is a really popular tool but it is not code-friendly.
K6 is a disrupting tool ( before known as Load Impact ) that is really developer friendly. The tests are written in Javascript BUT the modules are distributed by k6. No npm package works in k6 ( at least not all of them ). we can sort that out creating extensions and those are in GoLang. So we can contribute to the community too :)

advantages K6 vs Jmeter: K6 code friendly. Chaos Engineering integrated ( such as Litmus http-faults ). Highly scalable way more than Jmeter as Go Routines >> JVM.

So consider this and we can setup a test whenever :)

Eldarrin commented 1 year ago

Just an idea, what about jobs rather than objects as they avoid the HPA? Job can be a simple stamper to interface with the metric gatherer. Last time I did smoke/load was about 20 years ago lol so I may be off base.

tomkerkhove commented 1 year ago

K6 is nice, but the only managed offering of it that I know is https://k6.io/cloud. If we go with JMeter, I'm sure we can use a managed offering for it that gives the same level of insights/reporting which is not part of CLI output.

I really want to keep the load testing infra reduced to the minimum - The less we have to manage, the better. Hence why I suggested Azure Load Testing as that's a managed offering and we already have an Azure subscription (and this is unrelated from me working for Microsoft).

javaducky commented 1 year ago

K6 is nice, but the only managed offering of it that I know is https://k6.io/cloud.

We do now have Grafana Cloud k6 as a managed offering so you can have the metrics results there in your Grafana dashboards.

MrsDaehin commented 1 year ago

I am using Azure Load Testing and Jmeter in a daily basis but i am not really sure i can run an oss sampler ( well i have tried with a python script and i wasnt able ) to run the az cli to access the keyvault? Not really sure how to solve that access from Azure Test engines tbf. It is the only problem i see about using JMeter/AZ Load Testing. But again, i am not really sure you need a full load test rather than a benchmark.

tomkerkhove commented 1 year ago

K6 is nice, but the only managed offering of it that I know is k6.io/cloud.

We do now have Grafana Cloud k6 as a managed offering so you can have the metrics results there in your Grafana dashboards.

Correct, @javaducky, but we do not have a subscription for it unless Grafana wants to sponsor one for us?

I am using Azure Load Testing and Jmeter in a daily basis but i am not really sure i can run an oss sampler ( well i have tried with a python script and i wasnt able ) to run the az cli to access the keyvault? Not really sure how to solve that access from Azure Test engines tbf.

Can you elaborate on what you mean please? I'm not sure I get what limitation you are facing.

JorTurFer commented 1 year ago

AFAIK, we can use K6 as open source cli tool, exporting the results to other place from Grafana Cloud. I guess that's enough for us, am I right @MrsDaehin ?

MrsDaehin commented 1 year ago

We are using a Grafana Dashboard to show the results of K6 on real time. So it should be more than enough if we have a grafana available.

Can you elaborate on what you mean please? I'm not sure I get what limitation you are facing.

So as Jorge told me there is a moment in order to authenticate we need to get a value from a keyvault? And for that we need to run a script "somehow". In jmeter the solution to run commands is to use oss sampler but if you are running them in Azure Load Testing there is no way to reach the OS to execute a command in the test engine.

JorTurFer commented 1 year ago

So as Jorge told me there is a moment in order to authenticate we need to get a value from a keyvault? And for that we need to run a script "somehow". In jmeter the solution to run commands is to use oss sampler but if you are running them in Azure Load Testing there is no way to reach the OS to execute a command in the test engine.

No no, we don't need any value from any Key Vault xD. We need a token (from a service account) with enough permissions in the cluster RBAC to request metrics. We can get that token using kubectl or however, but that token isn't static so it needs to be recovered on every execution as test arrange. If we run as a pod inside the cluster, we can get bind the required role to the service account and get the service account token from the file system.

Other option could be (if Azure Load Test supports it) a bash script for getting AKS credentials using any kind of Azure authentication (if Azure Load Test supports it) and using it, execute the tests or get the required token

JorTurFer commented 1 year ago

About Azure Load Testing, is something that we can automatize somehow? we cannot assume that the portal will be available (it isn't indeed) and we need to create/manage/execute the tests using an API, also for getting the results as only MSFT folks can access to the portal.

tomkerkhove commented 1 year ago

Everything is possible, but the main important point for me is - It has to be simple and as minimal infrastructure as we can.

Using one tool and sending output to another tool that we have to spin up and have Grafana is already something I want to avoid because that means the data has to be stored somewhere so probably we'll need Prometheus as well which are all constantly running resources that we can't use only when we need them - No?

Hence the proposal to keep it simple and use a PaaS/SaaS such as Azure Load Testing, if we can get cloud-based K9 that's fine for me as well.

About Azure Load Testing, is something that we can automatize somehow? we cannot assume that the portal will be available (it isn't indeed) and we need to create/manage/execute the tests using an API, also for getting the results as only MSFT folks can access to the portal.

This gives reporting in the Azure Portal indeed, but this can be called from GitHub Actions so the integration should be simple - https://github.com/Azure/load-testing

Not sure what customization you want to have because everything is based on JMeter configuration + defined thresholds.

Results can be exported and we would only do load testing/benchmarking once a week or month so I think that should be fine.

JorTurFer commented 1 year ago

Not sure what customization you want to have because everything is based on JMeter configuration + defined thresholds.

My only concern about using it is that we should be able to do whatever we need without Azure Portal, if we can achieve it using terraform + gh actions, it's totally okey for me 😄
I want to avoid bottlenecks related with having to access to something in the portal

JorTurFer commented 1 year ago

BTW, I have seen that Grafana has a free tier that could cover our requirements:

javaducky commented 1 year ago

we do not have a subscription for it unless Grafana wants to sponsor one for us?

Thanks to @JorTurFer for providing the info about the Grafana Free Tier. My bad for not being explicit @tomkerkhove, as this Free Tier is what I meant to convey.

ppcano commented 1 year ago

Another alternative is detailed in this post; It stores the k6 test summary using the AzureTask/PublishTestResults.

JorTurFer commented 1 year ago

I have been talking with Nicole (from k6 team in Grafana Labs) about our use case during KubeCon and she told me that Grafana Labs has an open source program that we can request if we face with limits using the free tier, and they will provide us more resources. I asked her too if we can run the agents in our own infrastructure and push the results to Grafana Cloud (to have a place where to store the information) and she told me that it's possible, so I'd explore k6 instead of building our own system from scratch. There are multiple tools we can use for running the benchmarks, but we need to use the outcome from them easily

tomkerkhove commented 1 year ago

What would be the value of running our own agents? Can you expand on what agents you mean here?

JorTurFer commented 1 year ago

What would be the value of running our own agents? Can you expand on what agents you mean here?

The principal value I can see of running our own agents, is that we can use a cluster service account to access to the cluster, making the things easier because we don't need to expose anything, something running in the cluster has access to the KEDA endpoints, but the point I wanted to share is that I had been talking with people from k6 team and the account for using k6 is not a problem (the free tier should be enough, but we can request to increase as open source project).

I have been checking Azure Load Tests and I'm not totally sure about how we could access to the metrics server from Azure without exposing it externally. Do you have any idea @MrsDaehin ?

We started the issue a month ago and we haven't decided anything yet, I'd not like to see this stale as it's important information about how KEDA performs IMO, maybe we can discuss this during the standup...

tomkerkhove commented 1 year ago

Most probably we need to run an agent to be able to access the metric server (or use VNET-based service) but we will start with k6 and use Grafana Cloud k6 to get started.

We will ensure that there is enough docs for contributors to use it as well.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

JorTurFer commented 1 year ago

@MrsDaehin and I are working on this

arvinder06 commented 1 year ago

This will be a great feature to have. :)

JorTurFer commented 1 year ago

FYI https://github.com/kedacore/keda-performance The work is in progress 😄

kedacore / keda

Provide automated load testing of KEDA #4411

Benchmark metric server by cutting out our scalers / HPA