pomerium / ingress-controller

Pomerium Kubernetes Ingress Controller
https://pomerium.com
Apache License 2.0
22 stars 11 forks source link

Support zero downtime rollout restart of target deployments. #529

Open dangarthwaite opened 1 year ago

dangarthwaite commented 1 year ago

What happened?

Doing a rollout restart of the verify service results in a small window of downtime.

What did you expect to happen?

rollout restarts of a targeted application should have zero failed requests.

How'd it happen?

$ kubectl rollout restart -n ingress deploy/verify &&
  while sleep .25; do 
  curl -sv https://verify.example.com/healthcheck 2>&1 | grep -E '^< '; 
done
deployment.apps/verify restarted
< HTTP/2 302
< date: Mon, 13 Feb 2023 22:55:54 GMT
< content-type: text/html; charset=UTF-8
< content-length: 1411
< location: https://sso.example.com/.pomerium/sign_in?pomerium_expiry=1676329254&pomerium_idp_id=C3HESNcvnqS4eqcjUna9rVax8spGEWe2sC8F65GTt2ip&pomerium_issued=1676328954&pomerium_redirect_uri=https%3A%2F%2Fverify.example.com%2Fhealthcheck&pomerium_signature=3-t-KBAxkDVUDmG29m5yUifNAnatGs_qwvB5cmnl58w%3D
< x-pomerium-intercepted-response: true
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< server: envoy
< x-request-id: 6eda7b91-d88f-46e0-be2c-b265fcbebe88
<
< HTTP/2 404
< date: Mon, 13 Feb 2023 22:55:54 GMT
< content-type: text/html; charset=UTF-8
< content-length: 1414
< x-pomerium-intercepted-response: true
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< server: envoy
< x-request-id: b53bb012-9725-4174-b7a1-b566934fd3a4
<
< HTTP/2 302
< date: Mon, 13 Feb 2023 22:55:55 GMT
< content-type: text/html; charset=UTF-8
< content-length: 1411
< location: https://sso.example.com/.pomerium/sign_in?pomerium_expiry=1676329255&pomerium_idp_id=C3HESNcvnqS4eqcjUna9rVax8spGEWe2sC8F65GTt2ip&pomerium_issued=1676328955&pomerium_redirect_uri=https%3A%2F%2Fverify.example.com%2Fhealthcheck&pomerium_signature=NsZdyX7CDZJjKGpro7tebeQoGmrI2r53jZzSn_av2Dc%3D
< x-pomerium-intercepted-response: true
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< server: envoy
< x-request-id: 17ab805d-9fac-4eb5-8e3a-8608dc855d36

What's your environment like?

$ kubectl -n ingress get deploy/pomerium -o yaml | yq '.spec.template.spec.containers[0].image'
pomerium/ingress-controller:sha-cdc389c
$ kubectl get nodes -o yaml | yq '.items[-1] | .status.nodeInfo'
architecture: arm64
bootID: 247a7d1c-b579-4f89-b1b6-2b98883e4150
containerRuntimeVersion: docker://20.10.17
kernelVersion: 5.4.226-129.415.amzn2.aarch64
kubeProxyVersion: v1.21.14-eks-fb459a0
kubeletVersion: v1.21.14-eks-fb459a0
machineID: ec2d3bd9b9ea252972a242d7da68e233
operatingSystem: linux
osImage: Amazon Linux 2
systemUUID: ec2d3bd9-b9ea-2529-72a2-42d7da68e233

What's your config.yaml?

apiVersion: ingress.pomerium.io/v1
kind: Pomerium
metadata:
  name: global
spec:
  authenticate:
    url: https://sso.example.com
  certificates:
  - ingress/tls-wildcards
  identityProvider:
    provider: google
    secret: ingress/google-idp-creds
  secrets: ingress/bootstrap
status:
  ingress:
    ingress/verify:
      observedAt: "2023-02-13T22:55:54Z"
      observedGeneration: 2
      reconciled: true
    sandbox/example-ingress:
      observedAt: "2023-02-13T22:28:14Z"
      observedGeneration: 6
      reconciled: true
  settingsStatus:
    observedAt: "2023-02-10T15:33:57Z"
    observedGeneration: 5
    reconciled: true
    warnings:
    - 'storage: please specify a persistent storage backend, please see https://www.pomerium.com/docs/topics/data-storage#persistence'

What did you see in the logs?

{
  "level": "info",
  "service": "envoy",
  "upstream-cluster": "",
  "method": "GET",
  "authority": "verify.ops.bereal.me",
  "path": "/healthcheck",
  "user-agent": "curl/7.81.0",
  "referer": "",
  "forwarded-for": "71.254.0.45,10.123.60.7",
  "request-id": "fb2011be-f4d0-47bd-9094-8ad12f583009",
  "duration": 14.398539,
  "size": 1414,
  "response-code": 404,
  "response-code-details": "ext_authz_denied",
  "time": "2023-02-14T01:34:20Z",
  "message": "http-request"
}
wasaga commented 1 year ago

currently pomerium uses Service Endpoints object, that is updated once Pods are terminated or new ones become Ready. The update takes a bit of time which is the root cause of the downtime.

One current option to avoid the short downtime window should be to use kubernetes service proxy instead, see https://www.pomerium.com/docs/deploying/k8s/ingress#service-proxy

in the long term, we probably we should start using a newer EndpointSlice object that takes the pod conditions into the consideration. https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/#conditions