radius-project / radius

Radius is a cloud-native, portable application platform that makes app development easier for teams building cloud-native apps.
https://radapp.io
Apache License 2.0
1.46k stars 93 forks source link

Detect pod readiness check failing condition to report failure on "Waiting" deployments #5936

Open vinayada1 opened 1 year ago

vinayada1 commented 1 year ago

Overview of feature request

if the readinessProbe fails, the pod is not ready but is still in Running state. However, pod not ready is not a Terminal state and therefore we cannot report failure.

We could improve this behavior by looking for ways to detect failures when the deployment is stuck in this "Waiting" state where the readiness check fails. The pod events report health check failures and we could determine a way to access these events for better error reporting when the deployment is stuck in this state.

Acceptance criteria

If the readiness check fails, then we could detect this state and report a failure to the user with an appropriate error message as against letting the deployment timeout.

Additional context

https://github.com/project-radius/design-notes/pull/14

AB#8811

vinayada1 commented 1 year ago

I observed this behavior in the container logs. Therefore, just by looking at the event: "readiness check failed", looks like we cannot conclude that the pod cannot recover from this condition and fail the deployment. 2023/07/27 17:14:37 Server running at http://localhost:3000 2023/07/27 17:14:37 Check http://localhost:3000/healthz for status 2023/07/27 17:14:37 Starting magpie in http mode 2023/07/27 17:14:37 Starting Status Check... 2023/07/27 17:14:37 Container Name: magpiego-8811744666191772359 2023/07/27 17:14:37 Failed to create container: FromAssertion(): http call(https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/v2.0/token)(POST) error: reply status code was 401: {"error":"unauthorized_client","error_description":"AADSTS70021: No matching federated identity record found for presented assertion. Assertion Issuer: 'https://radiusoidc.blob.core.windows.net/kubeoidc/'. Assertion Subject: 'system:serviceaccount:azstorage-workload-app:azstorage-ctnr'. Assertion Audience: 'api://AzureADTokenExchange'. https://docs.microsoft.com/en-us/azure/active-directory/develop/workload-identity-federation\r\nTrace ID: 895e2f45-3bbd-499e-9ffe-5f691f761300\r\nCorrelation ID: 0272860c-8def-4e89-bb06-b623b48692a6\r\nTimestamp: 2023-07-27 17:14:37Z","error_codes":[70021],"timestamp":"2023-07-27 17:14:37Z","trace_id":"895e2f45-3bbd-499e-9ffe-5f691f761300","correlation_id":"0272860c-8def-4e89-bb06-b623b48692a6","error_uri":"https://login.microsoftonline.com/error?code=70021"} 2023/07/27 17:14:37 The readiness check failed 2023/07/27 17:14:38 Starting Status Check... 2023/07/27 17:14:38 Container Name: magpiego-426646045800814077 2023/07/27 17:14:38 Failed to create container: FromAssertion(): http call(https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/v2.0/token)(POST) error: reply status code was 401: {"error":"unauthorized_client","error_description":"AADSTS70021: No matching federated identity record found for presented assertion. Assertion Issuer: 'https://radiusoidc.blob.core.windows.net/kubeoidc/'. Assertion Subject: 'system:serviceaccount:azstorage-workload-app:azstorage-ctnr'. Assertion Audience: 'api://AzureADTokenExchange'. https://docs.microsoft.com/en-us/azure/active-directory/develop/workload-identity-federation\r\nTrace ID: f3c8b589-48ab-4307-91f7-16df87780f00\r\nCorrelation ID: ed186c74-a8dd-4be3-a9ca-5375438282c2\r\nTimestamp: 2023-07-27 17:14:38Z","error_codes":[70021],"timestamp":"2023-07-27 17:14:38Z","trace_id":"f3c8b589-48ab-4307-91f7-16df87780f00","correlation_id":"ed186c74-a8dd-4be3-a9ca-5375438282c2","error_uri":"https://login.microsoftonline.com/error?code=70021"} 2023/07/27 17:14:38 The readiness check failed 2023/07/27 17:14:46 Starting Status Check... 2023/07/27 17:14:46 Container Name: magpiego-2139734057240668151 2023/07/27 17:14:46 Failed to create container: FromAssertion(): http call(https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/v2.0/token)(POST) error: reply status code was 401: {"error":"unauthorized_client","error_description":"AADSTS70021: No matching federated identity record found for presented assertion. Assertion Issuer: 'https://radiusoidc.blob.core.windows.net/kubeoidc/'. Assertion Subject: 'system:serviceaccount:azstorage-workload-app:azstorage-ctnr'. Assertion Audience: 'api://AzureADTokenExchange'. https://docs.microsoft.com/en-us/azure/active-directory/develop/workload-identity-federation\r\nTrace ID: eb22c53c-fce1-4b23-968a-6a2ffabc1400\r\nCorrelation ID: 296e5264-a3cb-4279-afcc-c39f2cdeb64c\r\nTimestamp: 2023-07-27 17:14:46Z","error_codes":[70021],"timestamp":"2023-07-27 17:14:46Z","trace_id":"eb22c53c-fce1-4b23-968a-6a2ffabc1400","correlation_id":"296e5264-a3cb-4279-afcc-c39f2cdeb64c","error_uri":"https://login.microsoftonline.com/error?code=70021"} 2023/07/27 17:14:46 The readiness check failed 2023/07/27 17:14:56 Starting Status Check... 2023/07/27 17:14:56 Container Name: magpiego-1264835165545342164 2023/07/27 17:14:57 Successfully created a blob container "magpiego-1264835165545342164". Response: 8764da52-401e-0062-7bad-c0802a000000 2023/07/27 17:14:57 Successfully marked the container for deletion "magpiego-1264835165545342164". Response: 8764dc40-401e-0062-21ad-c0802a000000 2023/07/27 17:14:57 The readiness check passed 2023/07/27 17:15:06 Starting Status Check... 2023/07/27 17:15:06 Container Name: magpiego-3096723338942779695 2023/07/27 17:15:06 Successfully created a blob container "magpiego-3096723338942779695". Response: 876503ce-401e-0062-56ad-c0802a000000 2023/07/27 17:15:06 Successfully marked the container for deletion "magpiego-3096723338942779695". Response: 87650430-401e-0062-31ad-c0802a000000 2023/07/27 17:15:06 The readiness check passed

youngbupark commented 1 year ago

This container uses workload identity to access blob storage. workload identity in AKS uses mutation webhook controller, so the information(such as app id, token, etc) to use workload identity can be ready after some time. So that's why we could see this problem.