Defrag on etcd node causing API 500s

shyamjvs commented 4 years ago

On one of our large clusters (k8s 1.16) under a steady load of write calls, there was a brief period of ~1 min when all write requests were being heavily stalled. Meaning, the write call latencies spiked to upwards of 5s and many of those API calls got back HTTP 500 failures due to kube-apiserver seeing “timed out” or “too many requests” errors from etcd. See:

Screen Shot 2020-07-15 at 11 50 23 PM

From the kube-apiserver logs during that time, there were a bunch of such errors:

E0630 11:18:05.681970       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0x8, desc:"etcdserver: too many requests"}
E0630 11:18:05.783734       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0x8, desc:"etcdserver: too many requests"}
...
I0630 11:18:09.607116       1 trace.go:116] Trace[1288751478]: "GuaranteedUpdate etcd3" type:*core.Node (started: 2020-06-30 11:18:02.604604723 +0000 UTC m=+21828.371115177) (total time: 7.002491708s):
Trace[1288751478]: [7.002491708s] [7.001801958s] END
E0630 11:18:09.607144       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0630 11:18:09.607265       1 trace.go:116] Trace[275938462]: "Patch" url:/api/v1/nodes/ip-10-3-252-116.ec2.internal/status (started: 2020-06-30 11:18:02.604455155 +0000 UTC m=+21828.370965584) (total time: 7.002799756s):
Trace[275938462]: [7.002799756s] [7.002149015s] END

The 7s above comes from the default request timeout set on etcdserver. And FMU as etcd does locking for writes, slowness in writes leads to head of line blocking for reads that are waiting behind those writes. Consequently, we also observed very high read latencies during the above spike:

Screen Shot 2020-07-16 at 12 14 55 AM

There were 2 main problems to dig into here:

Why were writes timing out in etcd?

There was a spike in etcd_server_slow_apply_total during that time, which confirms that etcd was taking a long time to apply the incoming requests:

I spoke to @gyuho about this issue and found out that the "etcdserver: too many requests" errors comes from here which is when etcd's applied index lags behind the committed index by more than 5k requests. There was no CPU starvation for etcd, but there was a large spike in WriteBytes to the underlying disk:

Digging a bit more, I observed that both the pink and the blue spike corresponded to a defrag operation on their respective etcd instances. The pink one was a follower instance and it's defrag happened bit earlier, and there were no errors seen from around that time. However, the blue one was a leader and its defrag caused write requests to be heavily stalled (which is expected given etcd seems to lock the whole DB while doing the defrag). So,

Why were API errors seen only during leader defrag and not follower defrag? This is because leader is necessary for writes and is the one responsible for propagating those to the followers. When the follower was having issues, writes could still happen through quorum on the other two instances.
Should we trigger a defrag on a leader instance? Sooner or later we will have to call defrag on each instance. But causing defrag on a leader in a busy cluster is causing above performance issue. One approach is to try forcing a leader change before calling defrag for that instance. Not sure if this is the best way.. any suggestions?

Should kube-apiserver indeed return 500s in such cases?

There seem to be two concerns at the apiserver level here IMO:

Is HTTP 500 (Internal Server Error) the right response code to send to clients in such cases? If we argue that the reason for above issue is a mismanagement of defrag calls, then HTTP 500 makes sense. However, a similar issue might happen if etcd was bombarded with a ton of writes or it had a very slow disk (only a hypothesis atm, I haven't tested), in such cases it might make more sense for apiserver to return 429 (Too Many Requests), when it itself is receiving such an error from etcd.
Should it be throttling new requests as soon as it sees such errors from etcd? Again here, does it make sense for kube-apiserver to throttle write requests in some form (maybe for few seconds or something) after it receives the initial few “etcdserver: too many requests” errors, to safeguard from bombarding etcd further? (disclaimer - this is again something I'm not sure can happen outside of the defrag case we saw above, but I can try reproducing using some heavy write load if you think it's worth)

@wojtek-t @liggitt @lavalamp @xiang90 @jpbetz - would love to hear your thoughts on the above

/sig scalability /sig api-machinery

xiang90 commented 4 years ago

Which etcd server version are you using? We hit this in our internal clusters. But I think we have upstreamed the fixes around relaxing the lock durning compaction.

shyamjvs commented 4 years ago

Etcd version is v3.4.7

But I think we have upstreamed the fixes around relaxing the lock durning compaction.

To clarify, the above issue wasn't due to compaction. It was during defrag.

xiang90 commented 4 years ago

@shyamjvs

Oh... I misread it. For defrag, probably it is easier to just trigger a leader transfer before trigger it. From k8s side, probably doing some rate limiting and back off is also beneficial.

We could fix the root cause here https://github.com/etcd-io/etcd/blob/master/mvcc/backend/backend.go#L352. But it might take some effort.

fedebongio commented 4 years ago

/assign @lavalamp the relevant discussion for api-machinery is around if 500 is the right response code when this is happening /assign @jingyih for the discussion around defrag

fedebongio commented 4 years ago

(options discussed: 503, 429, 507... :) )

lavalamp commented 4 years ago

Meaning, the write call latencies spiked to upwards of 5s and many of those API calls got back HTTP 500 failures due to kube-apiserver seeing “timed out” or “too many requests” errors from etcd

This implies that apiserver is configured to have more pending requests than etcd is? I think that's wrong, you should consider reducing max-in-flight (or increasing the etcd equivalent, if there is one). You want apiserver to be more restrictive than etcd.

503 seems like the ideal response code in this condition, but if etcd is returning generic errors, it will be very hard to know that we're in this situation as opposed to some other problem.

wojtek-t commented 4 years ago

This implies that apiserver is configured to have more pending requests than etcd is? I think that's wrong, you should consider reducing max-in-flight (or increasing the etcd equivalent, if there is one). You want apiserver to be more restrictive than etcd.

In theory that sounds desired. But I think there is also a problem that on timeouts of k8s API calls, we're not cancelling requests to etcd, which can in theory contribute to this problem? Also thing like DeleteCollection can produce more than 1 call at a time.

But the biggest problem imho is multi-apiserver setup. In kube-apiserver, we have in-flight limits per instance. But there is single etcd cluster, so all requests from all kube-apiservers are hitting the same etcd. It would have take that into account.

Not even saying about requests served from cache (though I'm not sure if read requests matter here - do they?).

So I think it's a bit more tricky that what you wrote above Daniel.

lavalamp commented 4 years ago

But I think there is also a problem that on timeouts of k8s API calls, we're not cancelling requests to etcd, which can in theory contribute to this problem?

Yes, we should absolutely fix that, now that the client uses context we have absolutely no excuse!

Also thing like DeleteCollection can produce more than 1 call at a time.

Yeah, although that's bounded by the parallelism amount and there shouldn't be too many of those in flight at once.

But the biggest problem imho is multi-apiserver setup. In kube-apiserver, we have in-flight limits per instance. But there is single etcd cluster, so all requests from all kube-apiservers are hitting the same etcd. It would have take that into account.

Well, I think that is a system constraint! If (max number of concurrent requests) * (number of apiservers) is greater than the number of concurrent requests the etcd leader permits, there is a problem when the system becomes fully loaded! What is worse, since we permit self requests above the configured max-in-flight limit, if the limit is exactly at etcd's capacity, it's possible that we will send additional self-requests which will exceed etcd's capacity.

So, not only must the max-in-flight limit * # apiservers be less than etcd's capacity, there must also be some headroom, I'd say maybe 10% as a first guess since self requests would usually cause sequential and not parallel requests to etcd.

The upcoming APF feature will work similarly, except we'll drop the least important traffic (assuming a correct configuration) and certain other traffic (in particular, reentrant calls from webhooks) could also be "exempt" and proceed even if we're theoretically at the limit.

jingyih commented 4 years ago

@xiang90 When defrag happens on etcd leader, it stalls the leader's apply stage, but leader's log replication should still be functional? So under such circumstance, if etcd client (apiserver) sends a write request to leader, leader will reject this request because its apply is too far behind. But, if etcd client sends a write request to follower (who is not performing defrag), follower's raft instance should be able to forward this request to leader. And after leader performs log replication, follower should be able to apply this write. Is my understanding correct? Or did I miss anything?

wojtek-t commented 4 years ago

@lavalamp - that makes sense at the high level, but number of requests doesn't have to be equally spread across masters. So if we would set in-flights for an individual master to be e.g. equal to (90% of etcd capacity / number of masters), and most requests will be coming to a single master (e.g. because of http2 and reusing existing connections) we may potentially loose a lot of capacity. So I guess what I'm saying is: what you're suggesting makes a lot of sense, but it has to be accompanied by additional things (like rebalancing of connections across masters, etc.) + things we all know about (cancelling requests to etcd).

@xiang90 @jingyih - I also wanted to clarify one thing; what is happening to "quorum read" requests? do they also get rejected in such case? If so, the math here is much more complicated, because we have separate filters for mutating and non-mutating in-flights (and reads can be served from cache too).

jingyih commented 4 years ago

what is happening to "quorum read" requests? do they also get rejected in such case?

Yes. Quorum read is also affected (blocked) by defrag.

shyamjvs commented 4 years ago

Sorry.. just caught up with the discussion.

But the biggest problem imho is multi-apiserver setup. In kube-apiserver, we have in-flight limits per instance.

Fully agree with Wojtek's point here. In fact, this is something we did run into in practice while horizontally scaling out apiservers. And IMO it's far from obvious how to set the inflight limits based on different factors like etcd disk IOPS, network latency across etcd peers, etc. Further, dynamically readjusting those inflight limits as we're scaling in/out apiservers is not a very straight-forward action IMO to take in a production-like setup (especially given that it has to be changed even on currently running apiserver instances).

Also, client imbalance is a real issue we keep running into at scale. A simple apiserver restart (which could happen as a consequence of adjusting inflight requests itself, or any other reason for that matter) effectively renders that apiserver instance useless:

# kube-apiserver-1
apiserver_registered_watchers{group="",kind="Node",version="v1"} 8
apiserver_registered_watchers{group="",kind="Pod",version="v1"} 8

# kube-apiserver-2
apiserver_registered_watchers{group="",kind="Node",version="v1"} 1628
apiserver_registered_watchers{group="",kind="Pod",version="v1"} 3640

# kube-apiserver-3
apiserver_registered_watchers{group="",kind="Node",version="v1"} 2251
apiserver_registered_watchers{group="",kind="Pod",version="v1"} 4232

Even using something like goaway-chance, I suppose it'll take a while to restore the balance on a cluster with large no. of clients and fewer no. of apiservers? In general, it seems to me like apiserver inflight playing the catch-up game with etcd can be hard and we should think about at least the following:

Returning 429 responses instead of 500 when apiserver is getting "etcdserver: too many requests" (as it actually seems to be a legit indication of etcd not being able to keep up) so clients can be smart about retries
Some throttling mechanism (maybe for few seconds) in the apiserver when it receives such errors from etcd, to protect etcd from being bombarded further and to give it time to recover. I haven't yet thought much about how to do this.. but would like to hear general opinion about this approach from folks first.

gyuho commented 4 years ago

@xiang90 When defrag happens on etcd leader, it stalls the leader's apply stage, but leader's log replication should still be functional? So under such circumstance, if etcd client (apiserver) sends a write request to leader, leader will reject this request because its apply is too far behind. But, if etcd client sends a write request to follower (who is not performing defrag), follower's raft instance should be able to forward this request to leader. And after leader performs log replication, follower should be able to apply this write. Is my understanding correct? Or did I miss anything?

@xiang90 @jingyih brings up a good point.

Maybe we should move this discussion to https://github.com/etcd-io/etcd/issues/9222?

An important optimization in etcd is the separation of log replication (i.e., Raft) and state machine (i.e., key-value storage) where slow apply in state machine only affects the local key-value storage without blocking replication itself. This presents a set of challenges with defrag operation:

Write request to "any" node (leader or follower) incurs a proposal message to leader for log replication.
The log replication proceeds regardless of the local storage state: Committed index may increment while applied index stalls (e.g. due to slow apply or defragmentation). A state machine still accepts proposals.
As a result, leader's applied index may fall behind follower's, or vice versa.
The divergence stops when the gap between applied and committed index reaches 5,000: Mutable requests fail with an error "too many requests", while immutable requests may succeed.
Worse if slow apply happens due to defrag: all immutable requests block as well.
Even worse if the server with pending applies crashes: volatile states may be lost.

lavalamp commented 4 years ago

@wojtek-t

So if we would set in-flights for an individual master to be e.g. equal to (90% of etcd capacity / number of masters), and most requests will be coming to a single master (e.g. because of http2 and reusing existing connections) we may potentially loose a lot of capacity.

Yes, we decided we wouldn't try to coordinate APF across apiservers. That means we have to try to get even load to the apiservers. (Which is probably good anyway.) We've so far got one concrete mechanism, which is the probabilistic client drop (#88567). (I think we need a golang fix before we turn that back on.)

For APF, the intention is to eventually auto-tune the global concurrency limit, rather than set it by command line flag. So, e.g. etcd request latency might be a factor.

Note that the max-in-flight limit is not safe to lower, because it drops requests indiscriminately. APF is much safer, because important requests can be preferentially served.

jingyih commented 4 years ago

Thanks @gyuho, your summary in https://github.com/kubernetes/kubernetes/issues/93280#issuecomment-663719178 aligns with my understanding. I see 2 possible improvements to the encountered issue:

making defrag not blocking quorum reads
making etcd client's load balancer switch to a follower if the leader is doing defrag. (I believe today we only have round robin LB, am I correct?)

wojtek-t commented 4 years ago

(Which is probably good anyway.) We've so far got one concrete mechanism, which is the probabilistic client drop (#88567). (I think we need a golang fix before we turn that back on.)

+1. I actually think the golang fixed has rolled out (at least the patch was supposed to be rolled out on beg of July).

Note that the max-in-flight limit is not safe to lower, because it drops requests indiscriminately. APF is much safer, because important requests can be preferentially served.

Agree.

answer1991 commented 4 years ago

+1. I actually think the golang fixed has rolled out (at least the patch was supposed to be rolled out on beg of July).

Yes, the fix is done.

Kubernetes v1.19(or master branch) now is using golang 1.15, and golang 1.15 has fixed the issue. And Kubernetes v1.18 backport golang 1.15 is on going.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93280#issuecomment-818755541): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

ahrtr commented 1 year ago

Link to https://github.com/etcd-io/etcd/issues/16276

shyamjvs commented 1 year ago

/reopen

Reopening the issue. There's an ongoing proposal around etcd livez/readyz probes on the server-side, and gRPC healthchecks from the client side. Let's use this issue to track the change needed on k8s side for the latter, i.e update etcd client version with the fix. That should address defrag concern.

k8s-ci-robot commented 1 year ago

@shyamjvs: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/93280#issuecomment-1730144492): >/reopen > >Reopening the issue. There's an ongoing proposal around [etcd livez/readyz probes](https://docs.google.com/document/d/1PaUAp76j1X92h3jZF47m32oVlR8Y-p-arB5XOB7Nb6U/edit?usp=sharing) on the server-side, and [gRPC healthchecks](https://github.com/etcd-io/etcd/issues/16276) from the client side. Let's use this issue to track the change needed on k8s side for the latter, i.e update etcd client version with the fix. That should address defrag concern. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

shyamjvs commented 1 year ago

/remove-lifecycle rotten /triage accepted

k8s-triage-robot commented 3 days ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

kubernetes / kubernetes

Defrag on etcd node causing API 500s #93280

Why were writes timing out in etcd?

Should kube-apiserver indeed return 500s in such cases?