kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.01k stars 39.36k forks source link

Defrag on etcd node causing API 500s #93280

Open shyamjvs opened 4 years ago

shyamjvs commented 4 years ago

On one of our large clusters (k8s 1.16) under a steady load of write calls, there was a brief period of ~1 min when all write requests were being heavily stalled. Meaning, the write call latencies spiked to upwards of 5s and many of those API calls got back HTTP 500 failures due to kube-apiserver seeing “timed out” or “too many requests” errors from etcd. See:

Screen Shot 2020-07-15 at 11 50 23 PM

From the kube-apiserver logs during that time, there were a bunch of such errors:

E0630 11:18:05.681970       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0x8, desc:"etcdserver: too many requests"}
E0630 11:18:05.783734       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0x8, desc:"etcdserver: too many requests"}
...
I0630 11:18:09.607116       1 trace.go:116] Trace[1288751478]: "GuaranteedUpdate etcd3" type:*core.Node (started: 2020-06-30 11:18:02.604604723 +0000 UTC m=+21828.371115177) (total time: 7.002491708s):
Trace[1288751478]: [7.002491708s] [7.001801958s] END
E0630 11:18:09.607144       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0630 11:18:09.607265       1 trace.go:116] Trace[275938462]: "Patch" url:/api/v1/nodes/ip-10-3-252-116.ec2.internal/status (started: 2020-06-30 11:18:02.604455155 +0000 UTC m=+21828.370965584) (total time: 7.002799756s):
Trace[275938462]: [7.002799756s] [7.002149015s] END

The 7s above comes from the default request timeout set on etcdserver. And FMU as etcd does locking for writes, slowness in writes leads to head of line blocking for reads that are waiting behind those writes. Consequently, we also observed very high read latencies during the above spike:

Screen Shot 2020-07-16 at 12 14 55 AM

There were 2 main problems to dig into here:

Why were writes timing out in etcd?

There was a spike in etcd_server_slow_apply_total during that time, which confirms that etcd was taking a long time to apply the incoming requests:

image

I spoke to @gyuho about this issue and found out that the "etcdserver: too many requests" errors comes from here which is when etcd's applied index lags behind the committed index by more than 5k requests. There was no CPU starvation for etcd, but there was a large spike in WriteBytes to the underlying disk:

image

Digging a bit more, I observed that both the pink and the blue spike corresponded to a defrag operation on their respective etcd instances. The pink one was a follower instance and it's defrag happened bit earlier, and there were no errors seen from around that time. However, the blue one was a leader and its defrag caused write requests to be heavily stalled (which is expected given etcd seems to lock the whole DB while doing the defrag). So,

Should kube-apiserver indeed return 500s in such cases?

There seem to be two concerns at the apiserver level here IMO:

@wojtek-t @liggitt @lavalamp @xiang90 @jpbetz - would love to hear your thoughts on the above

/sig scalability /sig api-machinery

xiang90 commented 4 years ago

Which etcd server version are you using? We hit this in our internal clusters. But I think we have upstreamed the fixes around relaxing the lock durning compaction.

shyamjvs commented 4 years ago

Etcd version is v3.4.7

But I think we have upstreamed the fixes around relaxing the lock durning compaction.

To clarify, the above issue wasn't due to compaction. It was during defrag.

xiang90 commented 4 years ago

@shyamjvs

Oh... I misread it. For defrag, probably it is easier to just trigger a leader transfer before trigger it. From k8s side, probably doing some rate limiting and back off is also beneficial.

We could fix the root cause here https://github.com/etcd-io/etcd/blob/master/mvcc/backend/backend.go#L352. But it might take some effort.

fedebongio commented 4 years ago

/assign @lavalamp the relevant discussion for api-machinery is around if 500 is the right response code when this is happening /assign @jingyih for the discussion around defrag

fedebongio commented 4 years ago

(options discussed: 503, 429, 507... :) )

lavalamp commented 4 years ago

Meaning, the write call latencies spiked to upwards of 5s and many of those API calls got back HTTP 500 failures due to kube-apiserver seeing “timed out” or “too many requests” errors from etcd

This implies that apiserver is configured to have more pending requests than etcd is? I think that's wrong, you should consider reducing max-in-flight (or increasing the etcd equivalent, if there is one). You want apiserver to be more restrictive than etcd.


503 seems like the ideal response code in this condition, but if etcd is returning generic errors, it will be very hard to know that we're in this situation as opposed to some other problem.

wojtek-t commented 4 years ago

This implies that apiserver is configured to have more pending requests than etcd is? I think that's wrong, you should consider reducing max-in-flight (or increasing the etcd equivalent, if there is one). You want apiserver to be more restrictive than etcd.

In theory that sounds desired. But I think there is also a problem that on timeouts of k8s API calls, we're not cancelling requests to etcd, which can in theory contribute to this problem? Also thing like DeleteCollection can produce more than 1 call at a time.

But the biggest problem imho is multi-apiserver setup. In kube-apiserver, we have in-flight limits per instance. But there is single etcd cluster, so all requests from all kube-apiservers are hitting the same etcd. It would have take that into account.

Not even saying about requests served from cache (though I'm not sure if read requests matter here - do they?).

So I think it's a bit more tricky that what you wrote above Daniel.

lavalamp commented 4 years ago

But I think there is also a problem that on timeouts of k8s API calls, we're not cancelling requests to etcd, which can in theory contribute to this problem?

Yes, we should absolutely fix that, now that the client uses context we have absolutely no excuse!

Also thing like DeleteCollection can produce more than 1 call at a time.

Yeah, although that's bounded by the parallelism amount and there shouldn't be too many of those in flight at once.

But the biggest problem imho is multi-apiserver setup. In kube-apiserver, we have in-flight limits per instance. But there is single etcd cluster, so all requests from all kube-apiservers are hitting the same etcd. It would have take that into account.

Well, I think that is a system constraint! If (max number of concurrent requests) * (number of apiservers) is greater than the number of concurrent requests the etcd leader permits, there is a problem when the system becomes fully loaded! What is worse, since we permit self requests above the configured max-in-flight limit, if the limit is exactly at etcd's capacity, it's possible that we will send additional self-requests which will exceed etcd's capacity.

So, not only must the max-in-flight limit * # apiservers be less than etcd's capacity, there must also be some headroom, I'd say maybe 10% as a first guess since self requests would usually cause sequential and not parallel requests to etcd.

The upcoming APF feature will work similarly, except we'll drop the least important traffic (assuming a correct configuration) and certain other traffic (in particular, reentrant calls from webhooks) could also be "exempt" and proceed even if we're theoretically at the limit.

jingyih commented 4 years ago

@xiang90 When defrag happens on etcd leader, it stalls the leader's apply stage, but leader's log replication should still be functional? So under such circumstance, if etcd client (apiserver) sends a write request to leader, leader will reject this request because its apply is too far behind. But, if etcd client sends a write request to follower (who is not performing defrag), follower's raft instance should be able to forward this request to leader. And after leader performs log replication, follower should be able to apply this write. Is my understanding correct? Or did I miss anything?

wojtek-t commented 4 years ago

@lavalamp - that makes sense at the high level, but number of requests doesn't have to be equally spread across masters. So if we would set in-flights for an individual master to be e.g. equal to (90% of etcd capacity / number of masters), and most requests will be coming to a single master (e.g. because of http2 and reusing existing connections) we may potentially loose a lot of capacity. So I guess what I'm saying is: what you're suggesting makes a lot of sense, but it has to be accompanied by additional things (like rebalancing of connections across masters, etc.) + things we all know about (cancelling requests to etcd).

@xiang90 @jingyih - I also wanted to clarify one thing; what is happening to "quorum read" requests? do they also get rejected in such case? If so, the math here is much more complicated, because we have separate filters for mutating and non-mutating in-flights (and reads can be served from cache too).

jingyih commented 4 years ago

what is happening to "quorum read" requests? do they also get rejected in such case?

Yes. Quorum read is also affected (blocked) by defrag.

shyamjvs commented 4 years ago

Sorry.. just caught up with the discussion.

But the biggest problem imho is multi-apiserver setup. In kube-apiserver, we have in-flight limits per instance.

Fully agree with Wojtek's point here. In fact, this is something we did run into in practice while horizontally scaling out apiservers. And IMO it's far from obvious how to set the inflight limits based on different factors like etcd disk IOPS, network latency across etcd peers, etc. Further, dynamically readjusting those inflight limits as we're scaling in/out apiservers is not a very straight-forward action IMO to take in a production-like setup (especially given that it has to be changed even on currently running apiserver instances).

Also, client imbalance is a real issue we keep running into at scale. A simple apiserver restart (which could happen as a consequence of adjusting inflight requests itself, or any other reason for that matter) effectively renders that apiserver instance useless:

# kube-apiserver-1
apiserver_registered_watchers{group="",kind="Node",version="v1"} 8
apiserver_registered_watchers{group="",kind="Pod",version="v1"} 8

# kube-apiserver-2
apiserver_registered_watchers{group="",kind="Node",version="v1"} 1628
apiserver_registered_watchers{group="",kind="Pod",version="v1"} 3640

# kube-apiserver-3
apiserver_registered_watchers{group="",kind="Node",version="v1"} 2251
apiserver_registered_watchers{group="",kind="Pod",version="v1"} 4232

Even using something like goaway-chance, I suppose it'll take a while to restore the balance on a cluster with large no. of clients and fewer no. of apiservers? In general, it seems to me like apiserver inflight playing the catch-up game with etcd can be hard and we should think about at least the following:

gyuho commented 4 years ago

@xiang90 When defrag happens on etcd leader, it stalls the leader's apply stage, but leader's log replication should still be functional? So under such circumstance, if etcd client (apiserver) sends a write request to leader, leader will reject this request because its apply is too far behind. But, if etcd client sends a write request to follower (who is not performing defrag), follower's raft instance should be able to forward this request to leader. And after leader performs log replication, follower should be able to apply this write. Is my understanding correct? Or did I miss anything?

@xiang90 @jingyih brings up a good point.

Maybe we should move this discussion to https://github.com/etcd-io/etcd/issues/9222?

An important optimization in etcd is the separation of log replication (i.e., Raft) and state machine (i.e., key-value storage) where slow apply in state machine only affects the local key-value storage without blocking replication itself. This presents a set of challenges with defrag operation:

lavalamp commented 4 years ago

@wojtek-t

So if we would set in-flights for an individual master to be e.g. equal to (90% of etcd capacity / number of masters), and most requests will be coming to a single master (e.g. because of http2 and reusing existing connections) we may potentially loose a lot of capacity.

Yes, we decided we wouldn't try to coordinate APF across apiservers. That means we have to try to get even load to the apiservers. (Which is probably good anyway.) We've so far got one concrete mechanism, which is the probabilistic client drop (#88567). (I think we need a golang fix before we turn that back on.)

For APF, the intention is to eventually auto-tune the global concurrency limit, rather than set it by command line flag. So, e.g. etcd request latency might be a factor.

Note that the max-in-flight limit is not safe to lower, because it drops requests indiscriminately. APF is much safer, because important requests can be preferentially served.

jingyih commented 4 years ago

Thanks @gyuho, your summary in https://github.com/kubernetes/kubernetes/issues/93280#issuecomment-663719178 aligns with my understanding. I see 2 possible improvements to the encountered issue:

  1. making defrag not blocking quorum reads
  2. making etcd client's load balancer switch to a follower if the leader is doing defrag. (I believe today we only have round robin LB, am I correct?)
wojtek-t commented 4 years ago

(Which is probably good anyway.) We've so far got one concrete mechanism, which is the probabilistic client drop (#88567). (I think we need a golang fix before we turn that back on.)

+1. I actually think the golang fixed has rolled out (at least the patch was supposed to be rolled out on beg of July).

Note that the max-in-flight limit is not safe to lower, because it drops requests indiscriminately. APF is much safer, because important requests can be preferentially served.

Agree.

answer1991 commented 4 years ago

+1. I actually think the golang fixed has rolled out (at least the patch was supposed to be rolled out on beg of July).

Yes, the fix is done.

Kubernetes v1.19(or master branch) now is using golang 1.15, and golang 1.15 has fixed the issue. And Kubernetes v1.18 backport golang 1.15 is on going.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93280#issuecomment-818755541): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
ahrtr commented 1 year ago

Link to https://github.com/etcd-io/etcd/issues/16276

shyamjvs commented 1 year ago

/reopen

Reopening the issue. There's an ongoing proposal around etcd livez/readyz probes on the server-side, and gRPC healthchecks from the client side. Let's use this issue to track the change needed on k8s side for the latter, i.e update etcd client version with the fix. That should address defrag concern.

k8s-ci-robot commented 1 year ago

@shyamjvs: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93280#issuecomment-1730144492): >/reopen > >Reopening the issue. There's an ongoing proposal around [etcd livez/readyz probes](https://docs.google.com/document/d/1PaUAp76j1X92h3jZF47m32oVlR8Y-p-arB5XOB7Nb6U/edit?usp=sharing) on the server-side, and [gRPC healthchecks](https://github.com/etcd-io/etcd/issues/16276) from the client side. Let's use this issue to track the change needed on k8s side for the latter, i.e update etcd client version with the fix. That should address defrag concern. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
shyamjvs commented 1 year ago

/remove-lifecycle rotten /triage accepted

k8s-triage-robot commented 3 days ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted