Closed smarterclayton closed 6 years ago
This is must fix for shipping aggregated apis - this wedged me while trying to debug a cluster @openshift/sig-master
/cc @deads2k @sttts
Was metrics server, which was down because it was on a node with no network connectivity.
Ouch.
Do we know which discovery endpoint blocked? I would guess one of the /group/version/serverresources.json ?
For kubectl, I think we try to solve the "slow apiserver problem" client-side with dialer timeouts and tls connection timeouts. In the aggregator, I don't think we set an explicit timeout and expect clients to decide when to give up.
Once the last endpoint is removed the status controller will pull that APIService out of rotation, but a dangling, bad endpoint could still wedge.
Do you want the aggregator to be forcing the "how slow can you be" decision for all clients globally or would you expect clients to choose how long they'd like to wait?
kubectl
can't tell the difference between this and a slow API server. I'll see if I can find a timeout on the discovery request code. It'll be smaller than other options.
If the metrics server service has had no running pod, why didn't the connect fail right away?
I.e. either we have a retry loop somewhere or this sounds like a tcp problem.
If the metrics server service has had no running pod, why didn't the connect fail right away?
Dial trying to route to an IP that didn't exist and some part of the networking stack didn't fail quickly? The pod still shows as existing when the node dies so the endpoint is still listed.
I'd like a discovery bypass flag or something in kubectl that allows you to not rely on discovery (i.e. bypass all the smarts). The smarts are what broke this - kubectl delete apiservice should have worked without needing to get metrics discovery.
kubectl delete apiservice
requires discovery, right? there's no knowledge of which API group or version that is in otherwise
But if you can discover an API group that has apiservice, i shouldn't have to block on discovering all other groups.
Discovery needs to either be AP or CP, I think it needs to be AP because otherwise kuebctl dies. Or it has a --skip-discovery or --best-effort-discovery which lets you opt into AP
But if you can discover an API group that has apiservice, i shouldn't have to block on discovering all other groups.
definitely agree with that
@liggitt didn't you have a lazy download PR for discovery?
@liggitt didn't you have a lazy download PR for discovery?
you're thinking of https://github.com/kubernetes/kubernetes/pull/53303 which was for specific run/create commands, not commands where we rely on the full discovery tree for resource resolution
@sttts @deads2k see full client logs in this comment: https://github.com/openshift/origin/issues/17159#issuecomment-341559718 (click to expand)
@stevekuznetsov thanks for the logs. Next to @smarterclayton suggestion to have a ignore-discovery option, I still think https://github.com/openshift/origin/issues/17159#issuecomment-341716620 is the main issue: why didn't the Dial fail quickly? Do we have a test cluster to check what happens to a tcp connection attempt from the master to a just failed pod IP?
Dial failed because packets from the aggregator to that pod were being blackholed - in that case only dial timeout matters until the OS gives up.
On Fri, Nov 3, 2017 at 2:32 PM, Dr. Stefan Schimanski < notifications@github.com> wrote:
@stevekuznetsov https://github.com/stevekuznetsov thanks for the logs. Next to @smarterclayton https://github.com/smarterclayton suggestion to have a ignore-discovery option, I still think #17159 (comment) https://github.com/openshift/origin/issues/17159#issuecomment-341716620 is the main issue: why didn't the Dial fail quickly? Do we have a test cluster to check what happens to a tcp connection attempt from the master to a just failed pod IP?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/17159#issuecomment-341790191, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p5sbPLigEtF4aEqgJSJ9Tm_OlgWuks5sy1wzgaJpZM4QQWEx .
Did we mostly fix this?
Looks like we are hitting this again, maybe in a slightly different form on free-int: we get 429s and kubectl retries 10 times. Digging...
Verified: service catalog was returning 429 for unknown reason (saw some 1m timeout, but not more otherwise). Removing the APIService helped. /cc @mwoodson
/cc @juanvallejo
@sttts will update https://github.com/kubernetes/kubernetes/pull/60434 to reduce the number of retries, if that seems like the better approach
Alternative approach in https://github.com/kubernetes/kubernetes/pull/62733
Scenario:
Solution, do not fail this way.