ory / kratos

The most scalable and customizable identity server on the market. Replace your Homegrown, Auth0, Okta, Firebase with better UX and DX. Has all the tablestakes: Passkeys, Social Sign In, Multi-Factor Auth, SMS, SAML, TOTP, and more. Written in Go, cloud native, headless, API-first. Available as a service on Ory Network and for self-hosters.
https://www.ory.sh/?utm_source=github&utm_medium=banner&utm_campaign=kratos
Apache License 2.0
11.24k stars 963 forks source link

Kratos on k8s does not get ready with readinessProbe #1931

Closed viters closed 2 years ago

viters commented 2 years ago

Preflight checklist

Describe the bug

Extension to: https://github.com/ory/kratos/issues/1851.

Whenever I add readinessProbe to Kratos deployment, it never gets ready.

readinessProbe:
  httpGet:
    path: /health/ready
    port: http-admin
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 5

The deployment will not become ready - pod returns 503 (with no further explanation) no matter how much time passes (it gets restarted after pod gets marked as failing). Without readinessProbe, I can call /health/ready right after Kratos starts.

Reproducing the bug

I use minikube on Macbook Pro M1 2020, problem was both in Big Sur and right now in Monterey (12.0.1).

$ minikube version
minikube version: v1.23.2
commit: 0a0ad764652082477c00d51d2475284b5d39ceed
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/arm64"}
  1. Download my minimal k8s deployment Kustomization: kratos.zip
  2. Run the Kustomization, watch Kratos pod logs. After sqlite migrations are applied and readinessProbe kicks in (after 60 seconds), /health/alive requests work fine, but /health/ready returns 503 and prevents deployment from being marked as ready.
  3. When I remove readinessProbe from deployment configuration, just as after pod starts, I can forward traffic and call /health/ready myself with success:
    
    HTTP/1.1 200 OK
    Content-Type: application/json; charset=utf-8
    Date: Sat, 06 Nov 2021 19:35:38 GMT
    Content-Length: 16

{"status":"ok"}


The same problem applies to my more complicated setup with TLS, courier and postgresql connection. It's strange, because 503 response does not provide any additional information, nor it is visible in the logs.

### Relevant log output

```shell
time=2021-11-06T19:43:49Z level=debug msg=Successfully applied 589 migrations. audience=application service_name=Ory Kratos service_version=v0.8.0-alpha.3
time=2021-11-06T19:43:49Z level=debug msg=28.9989 seconds audience=application service_name=Ory Kratos service_version=v0.8.0-alpha.3
time=2021-11-06T19:43:49Z level=warning msg=Migrator: unable to dump schema audience=application error=map[message:open schema.sql: permission denied] service_name=Ory Kratos service_version=v0.8.0-alpha.3
time=2021-11-06T19:43:49Z level=info msg=Software quality assurance features are enabled. Learn more at: https://www.ory.sh/docs/ecosystem/sqa audience=application service_name=Ory Kratos service_version=v0.8.0-alpha.3
time=2021-11-06T19:43:49Z level=info msg=TLS has not been configured for admin, skipping audience=application service_name=Ory Kratos service_version=v0.8.0-alpha.3
time=2021-11-06T19:43:50Z level=info msg=Starting the admin httpd on: 0.0.0.0:4434 audience=application service_name=Ory Kratos service_version=v0.8.0-alpha.3
time=2021-11-06T19:43:50Z level=info msg=TLS has not been configured for public, skipping audience=application service_name=Ory Kratos service_version=v0.8.0-alpha.3
time=2021-11-06T19:43:50Z level=info msg=Starting the public httpd on: 0.0.0.0:4433 audience=application service_name=Ory Kratos service_version=v0.8.0-alpha.3
time=2021-11-06T19:43:50Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57142 scheme:http]
time=2021-11-06T19:43:50Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57140 scheme:http]
time=2021-11-06T19:43:50Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57142 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:16 status:200 text_status:OK took:255.847875ms]
time=2021-11-06T19:43:51Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57140 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:69 status:503 text_status:Service Unavailable took:1.053712167s]
time=2021-11-06T19:44:00Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57258 scheme:http]
time=2021-11-06T19:44:00Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57260 scheme:http]
time=2021-11-06T19:44:00Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57258 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:16 status:200 text_status:OK took:2.650583ms]
time=2021-11-06T19:44:01Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57260 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:69 status:503 text_status:Service Unavailable took:1.059639626s]
time=2021-11-06T19:44:10Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57368 scheme:http]
time=2021-11-06T19:44:10Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57370 scheme:http]
time=2021-11-06T19:44:10Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57368 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:16 status:200 text_status:OK took:1.075042ms]
time=2021-11-06T19:44:11Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57370 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:69 status:503 text_status:Service Unavailable took:1.059947959s]
time=2021-11-06T19:44:20Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57482 scheme:http]
time=2021-11-06T19:44:20Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57480 scheme:http]
time=2021-11-06T19:44:20Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57480 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:16 status:200 text_status:OK took:2.298667ms]
time=2021-11-06T19:44:21Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57482 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:69 status:503 text_status:Service Unavailable took:1.057544542s]
time=2021-11-06T19:44:30Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57576 scheme:http]
time=2021-11-06T19:44:30Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57578 scheme:http]
time=2021-11-06T19:44:30Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57578 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:16 status:200 text_status:OK took:1.213667ms]
time=2021-11-06T19:44:31Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57576 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:69 status:503 text_status:Service Unavailable took:1.057995376s]
time=2021-11-06T19:44:40Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57700 scheme:http]
time=2021-11-06T19:44:40Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57702 scheme:http]
time=2021-11-06T19:44:40Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57700 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:16 status:200 text_status:OK took:2.021209ms]
time=2021-11-06T19:44:41Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57702 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:69 status:503 text_status:Service Unavailable took:1.057815042s]
time=2021-11-06T19:44:49Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57798 scheme:http]
time=2021-11-06T19:44:50Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57800 scheme:http]
time=2021-11-06T19:44:50Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57800 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:16 status:200 text_status:OK took:2.373458ms]
time=2021-11-06T19:44:50Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57806 scheme:http]
time=2021-11-06T19:44:50Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57798 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:69 status:503 text_status:Service Unavailable took:1.040701917s]
time=2021-11-06T19:44:51Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57806 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:69 status:503 text_status:Service Unavailable took:1.053959459s]
time=2021-11-06T19:45:00Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57906 scheme:http]
time=2021-11-06T19:45:00Z level=info msg=started handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57908 scheme:http]
time=2021-11-06T19:45:00Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/alive query:<nil> remote:172.17.0.1:57906 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:16 status:200 text_status:OK took:1.439916ms]
time=2021-11-06T19:45:01Z level=info msg=completed handling request http_request=map[headers:map[accept:*/* connection:close user-agent:kube-probe/1.22] host:172.17.0.24:4434 method:GET path:/health/ready query:<nil> remote:172.17.0.1:57908 scheme:http] http_response=map[headers:map[content-type:application/json; charset=utf-8] size:69 status:503 text_status:Service Unavailable took:1.057572834s]

Relevant configuration

Everything is in Kustomization.

Version

oryd/kratos:v0.8.0-alpha.3-sqlite

On which operating system are you observing this issue?

macOS

In which environment are you deploying?

Kubernetes

aeneasr commented 2 years ago

The readiness probe takes one second to reply, according to logs. This can have many reasons but a slow DB being the most likely cause. Your probe probably cancels the request after a second, causing the pod to be killed!

viters commented 2 years ago

@aeneasr Yeah, you are right, silly me. I actually had similar problem with other service, but that service did inform in logs about a problem:

storage in WARN state, the observed value 1116.7 is above the threshold of 750ms

I am closing issue, because it definitely is not a bug. Just, maybe some additional warning in logs could be helpful.

sazzle2611 commented 2 years ago

Just wanted to add a comment here just in case it helps anyone as we had this exact same issue and although I saw this thread in my search I didn't believe it was the fix to our problem (the problem being a new pod would take up to an hour to start)

The reason I didn't pay more attention to start with was mainly down to my inexperience with kubernetes readiness probes and partly because we had been running kratos for several months and with 5 different installations of it, most on different database clusters and one at a completely different location but this started happening to all of them at the same time.

The fix was just the same as mentioned above, add

timeoutSeconds: 2

to the readiness probe, now the pods start in about 50 seconds.

Thanks @aeneasr and @viters