Open shyamjvs opened 4 years ago
Which etcd server version are you using? We hit this in our internal clusters. But I think we have upstreamed the fixes around relaxing the lock durning compaction.
Etcd version is v3.4.7
But I think we have upstreamed the fixes around relaxing the lock durning compaction.
To clarify, the above issue wasn't due to compaction. It was during defrag.
@shyamjvs
Oh... I misread it. For defrag, probably it is easier to just trigger a leader transfer before trigger it. From k8s side, probably doing some rate limiting and back off is also beneficial.
We could fix the root cause here https://github.com/etcd-io/etcd/blob/master/mvcc/backend/backend.go#L352. But it might take some effort.
/assign @lavalamp the relevant discussion for api-machinery is around if 500 is the right response code when this is happening /assign @jingyih for the discussion around defrag
(options discussed: 503, 429, 507... :) )
Meaning, the write call latencies spiked to upwards of 5s and many of those API calls got back HTTP 500 failures due to kube-apiserver seeing “timed out” or “too many requests” errors from etcd
This implies that apiserver is configured to have more pending requests than etcd is? I think that's wrong, you should consider reducing max-in-flight (or increasing the etcd equivalent, if there is one). You want apiserver to be more restrictive than etcd.
503 seems like the ideal response code in this condition, but if etcd is returning generic errors, it will be very hard to know that we're in this situation as opposed to some other problem.
This implies that apiserver is configured to have more pending requests than etcd is? I think that's wrong, you should consider reducing max-in-flight (or increasing the etcd equivalent, if there is one). You want apiserver to be more restrictive than etcd.
In theory that sounds desired. But I think there is also a problem that on timeouts of k8s API calls, we're not cancelling requests to etcd, which can in theory contribute to this problem? Also thing like DeleteCollection can produce more than 1 call at a time.
But the biggest problem imho is multi-apiserver setup. In kube-apiserver, we have in-flight limits per instance. But there is single etcd cluster, so all requests from all kube-apiservers are hitting the same etcd. It would have take that into account.
Not even saying about requests served from cache (though I'm not sure if read requests matter here - do they?).
So I think it's a bit more tricky that what you wrote above Daniel.
But I think there is also a problem that on timeouts of k8s API calls, we're not cancelling requests to etcd, which can in theory contribute to this problem?
Yes, we should absolutely fix that, now that the client uses context we have absolutely no excuse!
Also thing like DeleteCollection can produce more than 1 call at a time.
Yeah, although that's bounded by the parallelism amount and there shouldn't be too many of those in flight at once.
But the biggest problem imho is multi-apiserver setup. In kube-apiserver, we have in-flight limits per instance. But there is single etcd cluster, so all requests from all kube-apiservers are hitting the same etcd. It would have take that into account.
Well, I think that is a system constraint! If (max number of concurrent requests) * (number of apiservers) is greater than the number of concurrent requests the etcd leader permits, there is a problem when the system becomes fully loaded! What is worse, since we permit self requests above the configured max-in-flight limit, if the limit is exactly at etcd's capacity, it's possible that we will send additional self-requests which will exceed etcd's capacity.
So, not only must the max-in-flight limit * # apiservers be less than etcd's capacity, there must also be some headroom, I'd say maybe 10% as a first guess since self requests would usually cause sequential and not parallel requests to etcd.
The upcoming APF feature will work similarly, except we'll drop the least important traffic (assuming a correct configuration) and certain other traffic (in particular, reentrant calls from webhooks) could also be "exempt" and proceed even if we're theoretically at the limit.
@xiang90 When defrag happens on etcd leader, it stalls the leader's apply stage, but leader's log replication should still be functional? So under such circumstance, if etcd client (apiserver) sends a write request to leader, leader will reject this request because its apply is too far behind. But, if etcd client sends a write request to follower (who is not performing defrag), follower's raft instance should be able to forward this request to leader. And after leader performs log replication, follower should be able to apply this write. Is my understanding correct? Or did I miss anything?
@lavalamp - that makes sense at the high level, but number of requests doesn't have to be equally spread across masters. So if we would set in-flights for an individual master to be e.g. equal to (90% of etcd capacity / number of masters), and most requests will be coming to a single master (e.g. because of http2 and reusing existing connections) we may potentially loose a lot of capacity. So I guess what I'm saying is: what you're suggesting makes a lot of sense, but it has to be accompanied by additional things (like rebalancing of connections across masters, etc.) + things we all know about (cancelling requests to etcd).
@xiang90 @jingyih - I also wanted to clarify one thing; what is happening to "quorum read" requests? do they also get rejected in such case? If so, the math here is much more complicated, because we have separate filters for mutating and non-mutating in-flights (and reads can be served from cache too).
what is happening to "quorum read" requests? do they also get rejected in such case?
Yes. Quorum read is also affected (blocked) by defrag.
Sorry.. just caught up with the discussion.
But the biggest problem imho is multi-apiserver setup. In kube-apiserver, we have in-flight limits per instance.
Fully agree with Wojtek's point here. In fact, this is something we did run into in practice while horizontally scaling out apiservers. And IMO it's far from obvious how to set the inflight limits based on different factors like etcd disk IOPS, network latency across etcd peers, etc. Further, dynamically readjusting those inflight limits as we're scaling in/out apiservers is not a very straight-forward action IMO to take in a production-like setup (especially given that it has to be changed even on currently running apiserver instances).
Also, client imbalance is a real issue we keep running into at scale. A simple apiserver restart (which could happen as a consequence of adjusting inflight requests itself, or any other reason for that matter) effectively renders that apiserver instance useless:
# kube-apiserver-1
apiserver_registered_watchers{group="",kind="Node",version="v1"} 8
apiserver_registered_watchers{group="",kind="Pod",version="v1"} 8
# kube-apiserver-2
apiserver_registered_watchers{group="",kind="Node",version="v1"} 1628
apiserver_registered_watchers{group="",kind="Pod",version="v1"} 3640
# kube-apiserver-3
apiserver_registered_watchers{group="",kind="Node",version="v1"} 2251
apiserver_registered_watchers{group="",kind="Pod",version="v1"} 4232
Even using something like goaway-chance, I suppose it'll take a while to restore the balance on a cluster with large no. of clients and fewer no. of apiservers? In general, it seems to me like apiserver inflight playing the catch-up game with etcd can be hard and we should think about at least the following:
@xiang90 When defrag happens on etcd leader, it stalls the leader's apply stage, but leader's log replication should still be functional? So under such circumstance, if etcd client (apiserver) sends a write request to leader, leader will reject this request because its apply is too far behind. But, if etcd client sends a write request to follower (who is not performing defrag), follower's raft instance should be able to forward this request to leader. And after leader performs log replication, follower should be able to apply this write. Is my understanding correct? Or did I miss anything?
@xiang90 @jingyih brings up a good point.
Maybe we should move this discussion to https://github.com/etcd-io/etcd/issues/9222?
An important optimization in etcd is the separation of log replication (i.e., Raft) and state machine (i.e., key-value storage) where slow apply in state machine only affects the local key-value storage without blocking replication itself. This presents a set of challenges with defrag operation:
@wojtek-t
So if we would set in-flights for an individual master to be e.g. equal to (90% of etcd capacity / number of masters), and most requests will be coming to a single master (e.g. because of http2 and reusing existing connections) we may potentially loose a lot of capacity.
Yes, we decided we wouldn't try to coordinate APF across apiservers. That means we have to try to get even load to the apiservers. (Which is probably good anyway.) We've so far got one concrete mechanism, which is the probabilistic client drop (#88567). (I think we need a golang fix before we turn that back on.)
For APF, the intention is to eventually auto-tune the global concurrency limit, rather than set it by command line flag. So, e.g. etcd request latency might be a factor.
Note that the max-in-flight limit is not safe to lower, because it drops requests indiscriminately. APF is much safer, because important requests can be preferentially served.
Thanks @gyuho, your summary in https://github.com/kubernetes/kubernetes/issues/93280#issuecomment-663719178 aligns with my understanding. I see 2 possible improvements to the encountered issue:
(Which is probably good anyway.) We've so far got one concrete mechanism, which is the probabilistic client drop (#88567). (I think we need a golang fix before we turn that back on.)
+1. I actually think the golang fixed has rolled out (at least the patch was supposed to be rolled out on beg of July).
Note that the max-in-flight limit is not safe to lower, because it drops requests indiscriminately. APF is much safer, because important requests can be preferentially served.
Agree.
+1. I actually think the golang fixed has rolled out (at least the patch was supposed to be rolled out on beg of July).
Yes, the fix is done.
Kubernetes v1.19(or master branch) now is using golang 1.15, and golang 1.15 has fixed the issue. And Kubernetes v1.18 backport golang 1.15 is on going.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-contributor-experience at kubernetes/community. /close
@fejta-bot: Closing this issue.
/reopen
Reopening the issue. There's an ongoing proposal around etcd livez/readyz probes on the server-side, and gRPC healthchecks from the client side. Let's use this issue to track the change needed on k8s side for the latter, i.e update etcd client version with the fix. That should address defrag concern.
@shyamjvs: Reopened this issue.
/remove-lifecycle rotten /triage accepted
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
On one of our large clusters (k8s 1.16) under a steady load of write calls, there was a brief period of ~1 min when all write requests were being heavily stalled. Meaning, the write call latencies spiked to upwards of 5s and many of those API calls got back HTTP 500 failures due to kube-apiserver seeing “timed out” or “too many requests” errors from etcd. See:
From the kube-apiserver logs during that time, there were a bunch of such errors:
The 7s above comes from the default request timeout set on etcdserver. And FMU as etcd does locking for writes, slowness in writes leads to head of line blocking for reads that are waiting behind those writes. Consequently, we also observed very high read latencies during the above spike:
There were 2 main problems to dig into here:
Why were writes timing out in etcd?
There was a spike in
etcd_server_slow_apply_total
during that time, which confirms that etcd was taking a long time to apply the incoming requests:I spoke to @gyuho about this issue and found out that the "etcdserver: too many requests" errors comes from here which is when etcd's applied index lags behind the committed index by more than 5k requests. There was no CPU starvation for etcd, but there was a large spike in
WriteBytes
to the underlying disk:Digging a bit more, I observed that both the pink and the blue spike corresponded to a defrag operation on their respective etcd instances. The pink one was a follower instance and it's defrag happened bit earlier, and there were no errors seen from around that time. However, the blue one was a leader and its defrag caused write requests to be heavily stalled (which is expected given etcd seems to lock the whole DB while doing the defrag). So,
Why were API errors seen only during leader defrag and not follower defrag? This is because leader is necessary for writes and is the one responsible for propagating those to the followers. When the follower was having issues, writes could still happen through quorum on the other two instances.
Should we trigger a defrag on a leader instance? Sooner or later we will have to call defrag on each instance. But causing defrag on a leader in a busy cluster is causing above performance issue. One approach is to try forcing a leader change before calling defrag for that instance. Not sure if this is the best way.. any suggestions?
Should kube-apiserver indeed return 500s in such cases?
There seem to be two concerns at the apiserver level here IMO:
Is HTTP 500 (Internal Server Error) the right response code to send to clients in such cases? If we argue that the reason for above issue is a mismanagement of defrag calls, then HTTP 500 makes sense. However, a similar issue might happen if etcd was bombarded with a ton of writes or it had a very slow disk (only a hypothesis atm, I haven't tested), in such cases it might make more sense for apiserver to return 429 (Too Many Requests), when it itself is receiving such an error from etcd.
Should it be throttling new requests as soon as it sees such errors from etcd? Again here, does it make sense for kube-apiserver to throttle write requests in some form (maybe for few seconds or something) after it receives the initial few “etcdserver: too many requests” errors, to safeguard from bombarding etcd further? (disclaimer - this is again something I'm not sure can happen outside of the defrag case we saw above, but I can try reproducing using some heavy write load if you think it's worth)
@wojtek-t @liggitt @lavalamp @xiang90 @jpbetz - would love to hear your thoughts on the above
/sig scalability /sig api-machinery