Closed pieterlange closed 5 years ago
^ that's truly impressive!
It was merely a (very scary!) control plane outage. The monitoring systems were running during the outage but inaccessible since dex (the auth system) was down so i did actually have the data to prove that the apps were up, after the storm was over.
The biggest challenge in recovering from this failure was doing so without access to the monitoring systems (at some point i actually did make a ssh portforward directly to the machines running the grafana/kibana pods). Fun times.
Thanks!
@pieterlange can you sort it at the right place (newest on top)?
Is there any real solution for this issue? Right now anyone can make a curl loop and bring the cluster down
We use skipper as kube-apiserver sidecar to do auth. and we can easily add client rate limits: https://opensource.zalando.com/skipper/reference/filters/#clientratelimit
Auth is done by tokens and validation is done by a tokeninfo sidecar that is fast enough.
To protect your dex endpoint you can use either skipper in front of that, too, or bind it on localhost in the apiserver pod and use skipper to integrate with it. The localhost example might need some special routes, but this can be achieved.
@githubrotem see also https://twitter.com/pst418/status/1216739457400999938
Slides for my failure story related to the default dex configuration storing authrequests as CustomResources and its potential for nuking your kubernetes control plane.
The link: https://pieterlange.github.io/failure-stories/2019-06.dex.html Ref: https://github.com/dexidp/dex/issues/1292 Shared at: https://www.meetup.com/Dutch-Kubernetes-Meetup/events/262313920/