kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
115.23k stars 40.58k forks source link

Watch shouldn't break every 5 minutes #6513

Closed wojtek-t closed 10 years ago

wojtek-t commented 10 years ago

Currently Watch is broken every 5 minutes. It is then "recreated", which sometimes requires listing all elements under watch (if there were more than 1000 "uninteresting" etcd writes since then).

I digged deeper into it and it seems this is not related to etcd itself. What happens is:

  1. a component (e.g. scheduler) starts watch sending http request to apiserver
  2. apiserver initiates a watch on etcd
  3. after 5 minutes, the request to apiserver finished (this is something I don't fully understand - is this a timeout?)
  4. this triggers "canceling" the request to etcd

cc @fgrzadkowski

wojtek-t commented 10 years ago

I think that 3 is because of: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cmd/kube-apiserver/app/server.go#L386

I think that we should change it so that all "Watch" requests has a much bigger timeout (but don't change the timeout of all other gets). Is it possible to specify it at the client level by customizing its "transport" part?

What do you think about it? It shouldn't conflict with #6207, because we're ignoring watches there, right?

cc @brendanburns @davidopp @lavalamp

ghost commented 10 years ago

Yes, I can confirm that we're ignoring watches in #6207:

fs.StringVar(&s.LongRunningRequestRE, "long_running_requestregexp", "[.\/watch$][^\/proxy._]", "A regular expression matching long running requests which should be excluded from maximum inflight request handling.")

And yes, I think that watches should similarly be excluded from API server timeouts, but would like @bgrant0607 to comment on what the intended semantics and implementation of long-term watches are. Do they have any timeout? Is there intended to be a limit on the number of concurrent long-term watches open against the API server?

timothysc commented 10 years ago

@wojtek-t could you keep me in the loop on new items created | possibly label as perf b/c I think we are chasing the same things.

bgrant0607 commented 10 years ago

cc @smarterclayton

wojtek-t commented 10 years ago

@timothysc will do

smarterclayton commented 10 years ago

I think we need to fix the etcd problem. Working around it on our end is just too painful, and doesn't help us (short of aggregating all watches on our end, which is a larger chunk of work).

My recommendation is to work with upstream etcd to propose the minimal change for https://github.com/coreos/etcd/issues/2048 required to fix the issue on our end (a client option that requests that etcd indicate the window has closed and provide the latest etcd index at that time). That fixes the short term problem in a way that everyone benefits.

timothysc commented 10 years ago

@wojtek-t fwiw we had seen that only after a watch was reset from a timeout, would other actions continue if the cluster was in a stalled state, but we have not checked if recent etcd modifications re: #6059 have had an effect.

wojtek-t commented 10 years ago

@smarterclayton I don't understand your comment - I didn't observe any problem at the etcd level - the watch on etcd doesn't break itself the only problem I observe is the "scheduler -> apiserver" http request that is timing out [the etcd issue that you mentioned is only the consequence of out internal request timing out]

@timothysc: sure - I will look into it as soon as the new etcd release is built

smarterclayton commented 10 years ago

@wojtek-t the problem you alluded to of being outside the window of watchable events. The disconnect would not be a problem if the client knows to ask for resource version X+10000. The disconnect is only a problem because the client only knows to ask for resource version X.

It is then "recreated", which sometimes requires listing all elements under watch (if there were more than 1000 "uninteresting" etcd writes since then).

This is only a problem because clients don't get an updated resource version after 1000 uninteresting writes. #2048 is intended to fix that problem.

smarterclayton commented 10 years ago

The 5 minute timeout is our maximum http.Server timeout. I don't think we should increase this, because it doesn't fix the problem from #2048, which affects every watcher.

bgrant0607 commented 10 years ago

cc @hchoudh

wojtek-t commented 10 years ago

OK - I agree that the issue you mentioned is more important to fix. However, I think that having longer timeout for watches (or for some special requests) might be beneficial [or in other words - having shorter timeouts for other requests].

smarterclayton commented 10 years ago

On the other hand, dead clients are going to be consuming resources, especially now that we have limit pools. I do not want someone to be able to mount a DOS attack against the API server by opening watches and starving traffic. Having to reestablish their connection periodically is a part of ensuring that clients that exceed their rate limit are throttled (vs being able to open a watch forever).

----- Original Message -----

OK - I agree that the issue you mentioned is more important to fix. However, I think that having longer timeout for watches (or for some special requests) might be beneficial [or in other words - having shorter timeouts for other requests].


Reply to this email directly or view it on GitHub: https://github.com/GoogleCloudPlatform/kubernetes/issues/6513#issuecomment-90691241

davidopp commented 10 years ago

@smarterclayton can you explain this attack mechanism a bit more? Regardless of watch timeout, can't someone sucessfully attack by running a lot of clients that keep renewing? Statistically some non-attack clients will get through if the attackers keep having to renew, but if the attacker runs enough clients then I think they can keep that number close to zero.

davidopp commented 10 years ago

Also, can you send a pointer to the PR/issue that introduced "limit pools"?

smarterclayton commented 10 years ago

On Apr 7, 2015, at 3:01 PM, David Oppenheimer notifications@github.com wrote:

@smarterclayton can you explain this attack mechanism a bit more? Regardless of watch timeout, can't someone sucessfully attack by running a lot of clients that keep renewing? Statistically some non-attack clients will get through if the attackers keep having to renew, but if the attacker runs enough clients then I think they can keep that number close to zero.

If you have a rate limiter based on source ip or client credentials, you could ask a lot of watches at once, and refresh them within your rate window. The longer watches are, the lower the rate given to individual clients are. On the other hand, if reestablishing a watch is cheap (because clients update their window) then a lower timeout turns over the connected clients more often.

— Reply to this email directly or view it on GitHub.

davidopp commented 10 years ago

I see, so this helps if you have a rate limiter based on something that the attacker cannot easily generate a lot of. Do we have this in the system currently?

smarterclayton commented 10 years ago

We do not, although I would like to extend Brendan's initial work to that eventually.

On Apr 7, 2015, at 3:57 PM, David Oppenheimer notifications@github.com wrote:

I see, so this helps if you have a rate limiter based on something that the attacker cannot easily generate a lot of. Do we have this in the system currently?

— Reply to this email directly or view it on GitHub.

lavalamp commented 10 years ago

We have a time limit in apiserver, it could be lengthened. Note there's also a time limit in nginx.

I don't really think that a 5 minute timeout should be a problem. My thinking is that to get a substantial gain here, say 10x, we have to extend this period to 50 minutes. That seems clearly too long to me.

lavalamp commented 10 years ago

Really, to be correct, clients need to re-list occasionally, and they have to be able to handle the watch closing at any time. I think the five minute forced close is fine because it forces clients to do it right.

lavalamp commented 10 years ago

Essentially I think this behavior is by design, not a bug, and shouldn't adversely affect performance.

timothysc commented 10 years ago

@wojtek-t @lavalamp - the lumpiness in the response curve was our old graph:

image

It shows overall bursts of up to 30 pods/sec and our goal was to eliminate the lumps.
What is interesting is the slope difference between *old here and new #6207

davidopp commented 10 years ago

@lavalamp thinks the system should be able to handle a re-list every 5 min so this is WAI.

smarterclayton commented 10 years ago

I would clarify that based on etcd 2048. The system should be able to handle a re-list every five minutes. It should not require all clients to re-list every five minutes. Re-listing is a client responsibility.

----- Original Message -----

@lavalamp thinks the system should be able to handle a re-list every 5 min so this is WAI.


Reply to this email directly or view it on GitHub: https://github.com/GoogleCloudPlatform/kubernetes/issues/6513#issuecomment-90738598

wojtek-t commented 10 years ago

@smarterclayton @davidopp I agree that fixing the original issue on the etcd level is much better (and cleaner) solution. But IIUC, this is not of high priority for etcd and it will not happen any time soon. However, in 100-node cluster and each Kubelet having at least a watch on all pods, this may generate a significant load. So I still think that we might need to consider a temporary workaround for it in the near future.

cc @fgrzadkowski

smarterclayton commented 10 years ago

If it's a high priority for us we should help them fix it. I suspect that it's a fairly easy fix once the api details are sorted out (the server can return 204 No content with the updated etcd index when the watch window is exceeded, client can recognize that and return an error, we read the error and return a typed error back to clients, who update their resource watch version)

On Apr 8, 2015, at 8:12 AM, Wojciech Tyczynski notifications@github.com wrote:

@smarterclayton @davidopp I agree that fixing the original issue on the etcd level is much better (and cleaner) solution. But IIUC, this is not of high priority for etcd and it will not happen any time soon. However, in 100-node cluster and each Kubelet having at least a watch on all pods, this may generate a significant load. So I still think that we might need to consider a temporary workaround for it in the near future.

cc @fgrzadkowski

— Reply to this email directly or view it on GitHub.

gatici commented 5 years ago

Dear all, I am making a big installation with helm charts and my installation is failing everytime with below error. I am using rke cluster. Is it possible to increase the timeout time of watch ? How can I do it ?

Thanks,