apiserver cannot limit request

lojies commented 1 year ago

What happened?

Does kube-apiserver have internal rate limiting measures, apart from Admission Control (APF) rate limiting which seems to control only the number of simultaneous requests being processed? Now, suppose a large number of requests are being sent to the kube-apiserver by pods accessing Kubernetes services, it can result in very high CPU load on the apiserver, subsequently causing timeouts and issues with processing regular requests as well as health probes.

What did you expect to happen?

The apiserver can limit the quantity of requests, not only those that need processing but all external access, in order to prevent a surge in CPU usage due to high volumes of traffic, which could impact functionality.

How can we reproduce it (as minimally and precisely as possible)?

you can set --max-mutating-requests-inflight=10 and --max-requests-inflight=10, then send a lot of request by tools.the kube-apiserver's cpu very high and some request maybe not resoponded or overtime.

Anything else we need to know?

No response

Kubernetes version

```console $ kubectl version # paste output here ```

Cloud provider

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

lojies commented 1 year ago

/sig api-machinery

jiahuif commented 1 year ago

/assign /triage accepted

k8s-ci-robot commented 1 year ago

@jiahuif: The label(s) triage/accepged cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubernetes/kubernetes/issues/120787#issuecomment-1729935440): >/assign >/triage accepged Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

tkashem commented 1 year ago

/cc @wojtek-t @MikeSpreitzer

benluddy commented 1 year ago

@jiahuif: The label(s) triage/accepged cannot be applied, because the repository doesn't have them.

/triage accepted

jiahuif commented 1 year ago

then send a lot of request by tools

Could you elaborate what tools you were using? Were you using Kubernetes client, kubectl, HTTP benchmarking tool, or (D)DoS?

lojies commented 1 year ago

then send a lot of request by tools

Could you elaborate what tools you were using? Were you using Kubernetes client, kubectl, HTTP benchmarking tool, or (D)DoS?

we use HTTP benchmarking tool :vegata

aojea commented 1 year ago

I don't fully understand this issue

you can set --max-mutating-requests-inflight=10 and --max-requests-inflight=10, then send a lot of request by tools.the kube-apiserver's cpu very high and some request maybe not resoponded or overtime.

in order to prevent http ddos attacks you'll have to rate limit, that means that requests are going to be dropped

Can you expand a bit more on what experiments did you do, and based on your results what improvements are you suggesting?

lojies commented 1 year ago

in order to prevent http ddos attacks you'll have to rate limit, that means that requests are going to be dropped

Yes, indeed, many requests are being denied. However, I have limited the requests to 10, so in theory, the API server's CPU usage should not be excessively high. But it seems that, based on what I can see, the API server is still consuming a significant amount of CPU processing these denied requests before they are rejected.

Can you expand a bit more on what experiments did you do, and based on your results what improvements are you suggesting?

I have a cluster with 3 API servers, about 2,000 nodes and 100,000 pods. Many of these pods need to list some resources, including pods, nodes, CR and so on during their startup process. When I tested restarting all 3 API servers, I observed that the CPU usage of the API servers spiked significantly, almost reaching the upper limit, even though I had set a request rate limit of 10. According to my expectations, the CPU usage of the API servers should not have spiked so high, but in reality, it remains quite high.

Subsequently, we conducted performance testing on the API server using both tools and custom programs. We observed that the CPU utilization of the API server remained consistently high in both cases. However, when we increased the retryAfter value, we noticed a significant reduction in CPU usage.

You can write a simple program that concurrently keeps listing resources continuously while setting --max-mutating-requests-inflight=10 and --max-requests-inflight=10, and then observe the CPU changes of the API server.

aojea commented 1 year ago

What version are you using?

based on your comments it seems you are hitting a known scalability problem, that will be solved by https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3157-watch-list , @wojtek-t and @p0lyn0mial are the best persons to judge if this is related

wojtek-t commented 1 year ago

I bet that the problem is opening TCP connections , with https that may consume a lot of resources and is happening before any kube-apiserver logic actually fires.

Assuming this is the case, there is not much we can do in kube-apiserver itself - it would have to be solved either in the golang http server or in some layer in front of it (like load-balancer).

lojies commented 1 year ago

I bet that the problem is opening TCP connections , with https that may consume a lot of resources and is happening before any kube-apiserver logic actually fires.

Assuming this is the case, there is not much we can do in kube-apiserver itself - it would have to be solved either in the golang http server or in some layer in front of it (like load-balancer).

Yes, i think you got what i want to show.we can limit in load-balancer, but now most of the connections are in pods and use the kubernetes service which cannot be limited by load-balancer.

I'm not sure if we can limit this by kube-proxy.

nayihz commented 1 year ago

a flame graph may tell the truth.

p0lyn0mial commented 1 year ago

Yes, I think that CPU and memory profiles from the test could reveal the potential issue.

lojies commented 1 year ago

max-mutating-requests-inflight=10 and --max-requests-inflight=10 default rate limit and restart 3 apiserver

p0lyn0mial commented 1 year ago

If I am interpreting the first graph correctly, it appears that the CPU spends most of its time in the authentication filter (an HTTP handler) specifically on verifying a certificate signature. This makes sense since this is a cryptographic operation which is CPU-bound.

It seems that the authentication filter is placed before the APF filter. In this case it means that we did some processing just to place the request into a queue. I don't know why the APF filter is placed after the authentication filter. It might be because we require some authentication information, or we don't want to have unauthenticated requests sitting in the queue. Does anyone know why?

Now I can also see that the api server could have some sort of protection even on the L4 layer - before a TLS handshake or even earlier. The question here is whether the community would support integrating such a mechanism into the api server.

MikeSpreitzer commented 1 year ago

I think a serious server setup requires multiple layers of control. If your earliest control is after crypto then your attacker has an easy time. OTOH, control on un-authenticated request attributes is something that is only safe to be done one-sided: you can disfavor based on that but not favor.

wojtek-t commented 1 year ago

Agree with Mike.

Following up on my previous comment above - I think the mechanism that we need to protect against this problem is different:

in APF we're currently limiting number of inflight-requests
for limiting the impact of crypto operations, we should rather have a QPS-based limit

MikeSpreitzer commented 9 months ago

To squarely answer the question about why the APF filter is after authentication: it is a deliberate choice based on not trusting unauthenticated stuff.

I think that it is legitimate to be concerned about controlling the load on the crypto in authentication. To me, that sounds like a distinct feature from APF, as it would be designed with somewhat different concerns in mind. It inherently is dealing with untrusted stuff. It slides into DOS protection, which is something you want to push as far "out" in front of the server as possible. Got a load balancer in front of your servers? You probably want to do this there. Running on a cloud with DOS protection? You probably want to use that.

kubernetes / kubernetes