grafana / k6

A modern load testing tool, using Go and JavaScript - https://k6.io
GNU Affero General Public License v3.0
25.53k stars 1.26k forks source link

No KeepAlive timeout set in load test clients #3466

Open szuecs opened 10 months ago

szuecs commented 10 months ago

Brief summary

We use the k6 nodejs http client to do load tests and as we can see in CPU graphs it sticks with only one socket to the target. This is of course saturating only a single target even if the target is autoscaling, which is a severe bug for a load test tool. I am not very experienced in nodejs, but you can likely set keepalive timeout to something like 60s on the http client to fix the issue. See also this Go issue that shows the problem in more detail https://github.com/golang/go/issues/23427 .

Example load test client:

import http from "k6/http";

function getLimits(order, baseUrl, token) {

    let url = `${baseUrl}/purchase-quantity-restrictions`
    let payload = JSON.stringify({
...
    });

    let params = {
        headers: {
            "Authorization": `Bearer ${token}`,
            "Content-Type": "application/json",
...
        },
    };

    return http.post(url, payload, params);

k6 version

latest

OS

ubuntu 22.04 lts

Docker version and image (if applicable)

FROM loadimpact/k6:latest AS k6official

Steps to reproduce the problem

run this client and increase load to an autoscaled target.

Expected behaviour

all targets should have the same CPU usage

Actual behaviour

new started targets are not having the same CPU usage as the others.

kullanici0606 commented 9 months ago

Hi,

Could you please tell how does autoscaling work in your case? For example, in AWS, I think, autoscaling works by adding new nodes to load balancer definitions, in that case DNS is not changed.

In your case, are new nodes' IP addresses added to the DNS when autoscaling launches new nodes? Is load balancing handled by DNS?

For load balancer case, sticky sessions may cause unbalanced node loads. Also, sometimes load balancers use source ip address for load balancing strategy, which is not ideal for load testing scenario if load is generated from one IP address.

Could you give more information about your setup, please?

codebien commented 9 months ago

Hi @szuecs, I can imagine some use cases, but as @kullanici0606 it would be helpful to get a better explanation about your specifics. It would be nice to know what limits you're hitting using the currently available way by the DNS option https://grafana.com/docs/k6/latest/using-k6/k6-options/reference/#dns.

Why do you evaluate it as a bug? It doesn't sound so. I don't think we say somewhere that we support this case. It sounds to me simply a feature to add. I removed the label for now and we might re-add it after if it is required.

szuecs commented 9 months ago

@codebien @kullanici0606 The problem is not DNS! I referenced this Go issue to make you aware that TCP connections are not closed and recreated but hold forever.

There are 2 problems. One problem is if you run for example NLB than you get packets forwarded and not proxied, which means you load test connection pool won't connect to new connected nodes that are attached to the NLB TargetGroup. The second problem is that you can take down an ALB because you connect not to the scaled-out ALB instances. ALB will have initially only 3 EC2 instances if you create an ALB spanning 3 AZs, one in each zone. If you hit their limit, they need to scale out, so also adding more EC2 and later also adding more IPs to the ALB. These IPs you will not lookup as the Go issue shows.

In both cases you limit the throughput to whatever was available at start of the load test. That means it's not a scalable solution and scaling out upfront is not really a realistic load test scenario.

I hope I could clarify your questions and why it's a bug and not a feature.

Our load test will run in one of our kubernetes clusters that run on nodes with public IPs attached, such that all pods running on one node share the same src IP, but that seems to be fine. The ingress setup is best explained in https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#aws-deployment but replace ALB with NLB. The second layer load balancer is skipper, which is the only one in the TargetGroup of the NLBs and baked by an hpa as shown in https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#deployment

szuecs commented 6 months ago

@codebien I tried to do my diligence, but I got no response and the label "awaiting user" seems not to be correct if I respond and got no comment.

mohitk05 commented 6 months ago

Hello @codebien @kullanici0606, I work with Sandor and I looked into this issue since we are planning to use k6 to perform large load tests. I wanted to highlight a few points here and propose solutions for us to discuss.

As @szuecs has pointed out, the mentioned use case is indeed an issue but it might be tricky to place it into the right category (bug/feature) because of its occurrence under specific situations. I looked into projects other than k6 and checked public articles to find if there are others who face this issue and I found the following:

The above links show that this is a common issue faced when persistent TCP connections are used against a target that is auto-scaling, though the discussions seem to be open and there is no single accepted solution.

I believe k6 should provide ways to solve this since the use case would be more frequently faced when running load tests. Disabling persistent connection can poorly affect k6's performance hence I'm avoiding that as a potential solution.

To solve this issue, I can think of the following solutions:

  1. Provide an option in k6 to set the maximum number of requests per persistent TCP connection
  2. Provide an option in k6 to close TCP connections every X seconds (a TTL for connections, similar to Sandor's suggestion of a goroutine)
  3. Send a Connection: close header from the load balancer at regular intervals (difficult to implement).

Solving this in the load test client seems like a better solution given the probability of facing this issue. Other load test tools that solve this:

codebien commented 5 months ago

Hi @mohitk05 @szuecs, we met today with the team to discuss this issue.

In summary, our position remains similar as previously communicated: we might add support for this use-case as a feature in the next version of the HTTP module, that we’re currently developing, where we expect to introduce a more configurable solution to this specific issue.

To give you more context:

Unfortunately, as we explained previously in another issue, any configuration that will force closing connections has to be as much as possible predictable, otherwise, they might impact metrics and create confusion during investigations, troubleshooting or debugging.

Yet, even if we were keen to introduce a well-documented, configurable change that would bring a solution to this, we would still encounter a significant technical challenge: there isn't, currently, anyway for k6 to control the HTTP connection pool directly. It is mostly the same reason as for @szuecs’s original issue on the Go repository. The available API provided by Go to enforce closing connections via the HTTP package is closeConnectionIdle, which doesn’t guarantee any closing on the active connections, as the name clearly explains.

Supporting this use-case would effectively involve building custom solutions that would not be relying on the standard Go HTTP library, which is a practice we aim to follow as much as possible. Doing the opposite is a difficult decision that has to be properly tested and evaluated considering all the corner cases HTTP, TCP and networking have in general. Continuing to use the Go standard library for HTTP module, is for us the guarantee for high standards to one the most critical modules of k6.

Of course, as with all engineering decisions, it has trade-offs; unfortunately, this issue is one of them.

At this point, the alternative option we can offer, other than the already available work-around to use noVUConnectionReuse option is for you to benefit from the extensions model where you can build your custom implementation. We can offer our support there helping in finding solutions and reviewing code, for us, it might be useful as an exploration phase for httpv2.

Eventually, if you build something generic then it could even be considered as the first version for HTTP v2 and we may consider including it as an experimental module in k6 core, if it is something you are interested on benfitting. But, any evaluation, has to be deferred to future time when your extension will be available. Instead, if this is already right now a high desire to you, then feel free to start discussing here a potential proposal and how you would like to implement it.

I hope it helps.

szuecs commented 5 months ago

Thanks for coming back to us!

In general I think we can try to use noVUConnectionReuse option. @mohitk05 can you test it and provide feedback if that helps? I am happy to review all kind of data on skipper side.

For us (I am running skipper as ingress router with 1M rps every day in 100 production clusters), the closeIdleConnection loop never showed a negative impact on performance. I reviewed hundreds of latency issues that never showed that connection establishments were really a problem. If you stay inside an aws region I never saw >1ms connection times measured by go httptrace module. Here's our client deeplink to httptrace instrumentation https://github.com/zalando/skipper/blob/master/net/httpclient.go#L404 The Go httptrace module is pure gold!

Let me know if I can help more.

RafaMalicia commented 4 months ago

Hi,

Could you please tell how does autoscaling work in your case? For example, in AWS, I think, autoscaling works by adding new nodes to load balancer definitions, in that case DNS is not changed.

In your case, are new nodes' IP addresses added to the DNS when autoscaling launches new nodes? Is load balancing handled by DNS?

For load balancer case, sticky sessions may cause unbalanced node loads. Also, sometimes load balancers use source ip address for load balancing strategy, which is not ideal for load testing scenario if load is generated from one IP address.

Could you give more information about your setup, please?

Hello, I do have exactly this issue, seems to be that the sticky session option on AAG v2 is causing unbalanced application calls, is there any workaround I can use to prevent this? I'm using a single VM (one IP address) to run a local test against 2 web servers controlled by an AAG. Thanks in advance!

codebien commented 4 months ago

Hi @RafaMalicia,

seems to be that the sticky session option on AAG v2 is causing unbalanced application calls

I'm not an expert of Azure platform, but on AWS seems to be a dedicated section regarding how to workaround this case. The different suggestions seem to be platform-agnostic so you could consider applying the same strategy to Azure as well.

Check the Manual rebalancing section inside of the sticky sessions page on the AWS ALB docs.

mohitk05 commented 3 weeks ago

Hello again 👋 We have been using open-source k6 deployed in Kubernetes internally and have successfully conducted large-scale load tests with it. I, first of all, want to appreciate the tooling available and the amount of customization and control we get on the tests! Here are a few learnings we found out regarding using noConnectionReuse:

I explored k6 extensions as you suggested and looks like we might have a solution there. I created this extension and will be testing it out, leaving here if anyone faces similar issues: https://github.com/mohitk05/xk6-close-idle-connections/blob/main/close_idle_conn.go. My assumption would be that this extension would also have some performance impact, but would be much better than no keep-alives. I will share results from our testing soon.

szuecs commented 3 weeks ago

@mohitk05 yes it reduces a bot of performance because every few seconds unused idle connections are closed and the system needs to resolve DNS and connects (TCP/IP handshake) to the endpoint. However if we do this every 30s and we would have 100 RPS, than we would have at least 3k requests possible through one connection until we need to create a new one (in practice it's likely a bit different but for the purpose of simple calculation let's assume it is like this). So instead of 1 requests per handshake we would have 3k requests per handshake.