pulumi / pulumi-kubernetes

A Pulumi resource provider for Kubernetes to manage API resources and workloads in running clusters
https://www.pulumi.com/docs/reference/clouds/kubernetes/
Apache License 2.0
407 stars 117 forks source link

Feature request: Show error logs when container goes into CrashLoopBackOff during Deployment #535

Open geekflyer opened 5 years ago

geekflyer commented 5 years ago

I observed a frequent annoyance when deploying with pulumi: Every few days someone deploys a kubernetes Deployment using Pulumi and for n number of reasons the deployment fails with the container going into CrashLoopBackOff. The error message pulumi gives in this case (CrashLoopBackOff), e.g.: kubernetes:apps:Deployment graphql-gateway updating [diff: ~metadata,spec]; warning: 1 Pods failed to run because: [CrashLoopBackOff] Back-off 2m40s restarting failed container=main pod=graphql-gateway-77f5697bf-vspjd_solvvy-staging(2d072fc3-5d56-11e9-95f2-42010a8a0fcf) is honestly not very helpful and an engineer has to manually use kubectl logs <name_of_pod_which_is_crashing to figure out what's really going on. I think it'd be much better DX if you show the logs (even if it's just the last few lines) before the crash happened instead of simply showing a generic CrashLoopBackOff message.

This is probably somewhat related to https://github.com/pulumi/pulumi-kubernetes/issues/165 but I explicitly want to see those logs automatically when a deployment fails and not via a separate pulumi logs command.

lblackstone commented 5 years ago

I'm making progress on this, but won't have time to finish in M23.

hausdorff commented 5 years ago

This might not be necessary for Q3. And, I'm not sure how we'd make it work in practice.

blampe commented 5 months ago

It would be helpful to hear a little about how folks would like this to work.

We can probably surface these as "status" logs, but those are ephemeral and sometimes truncated if the message is too long. Is it helpful just to see that something is in CrashLoopBackOff during the update?

Or would it be more helpful to, at the end of the update, show the full CrashLoopBackOff error? We could try to de-dupe this if a bunch of pods are affected, but that could potentially dump a lot of logs.

I'm making progress on this, but won't have time to finish in M23.

@lblackstone it's been a long time but do you happen to have any of this work still lying around, or do you remember how you envisioned this would work?

Daniel-Vetter-Coverwhale commented 5 months ago

Hello, I'm really excited to see movement on this! Thank you guys!

I think having logs in both places would be great!

Just seeing that something is in a crash loop backoff isn't very useful. The reason why is very important, and being able to show that reason (or at least logs we think will explain that reason) to folks who don't necessarily have access to the kubernetes cluster is very useful.

We have pulumi doing deployments on pushes to repos, and we have contributors to those repos who don't have (nor do we necessarily want to give them) access to the kubernetes clusters. So they push some code, it deploys, and in the normal case all is well. However when it fails to deploy, they currently have to ask someone with access to the cluster why the deployment has failed, which generally is just going and looking at the pod logs.

Recently we've been having some problems with some team members forgetting to add packages such that they actually wind up in the Docker build, or not setting up appropriate secrets access, both of which thankfully show up pretty quickly in the crashing pods logs. So having them as status logs that you can watch while the update is attempting to go through would be good for when folks are watching their deployments, as they can quickly see and get a fix queued up before the deployment even finishes failing.

Keeping the message around, or a sample of messages around for longer term display on the failed update would also be really good (probably more useful overall), as sometimes they attempt to deploy on a Friday and don't get around to actually checking on the deployment until Monday.

lblackstone commented 5 months ago

@blampe I can't find any WIP code for this.

I think that printing status logs would be useful if you can surface feedback early rather than waiting for a timeout, and these logs will show up in CI systems and the Console. However, this can be a transient condition, so you don't want to exit immediately.

When the await logic eventually times out and terminates, you could grab the latest container logs and include them with the error message.

https://github.com/pulumi/pulumi-kubernetes/tree/master/provider/pkg/logging can be used to dedupe logs.

D3-LucaPiombino commented 4 months ago

@lblackstone , @blampe Although i do understand the difficulty from a technical standpoint, this at the moment really provide a lacklustre experience, and IMHO should be given a little bit more priority.

From an usability standpoint, the problem is that, depending on the segregation of responsibilities in a company, there are cases in which not everyone has direct access to the cluster. For example, pushing a small change to the application configuration is not the same as changing the topology of the deployment.

Even if we say this should go into the log subsystem (e.g. Datadog, New Relic, etc...), again, in some case not everyone has access to those.

Having the ability to at least provide a small indication of what the problem is directly after the operationc completes, would remove the need to escalate everything to the "devops team".

I took a look at the await logic/code and i believe the situation can be be improved by adding some logic to also consider the last state and in general provide a better integration with what the terminationMessagePath and terminationMessagePolicy surface in the containers status.

For example, we have deployment where the container report some failure via logs on startup and then the container is configured with terminationMessagePolicy: FallbackToLogsOnError. Rhe appplication perform an initial validation of the provided configuration to gate the completion of rolling deployments.

This works as expected from a functional standpoint. The problem is that there is no easy(that is not convoluted/excedeedly complicate) way to provide feedback to the user.

The worst thing is that i can see expected info when i query the status of the pod:

# ...
  podIPs:
    - ip: 10.0.46.215
  startTime: '2024-06-29T21:27:14Z'
  containerStatuses:
    - name: bg-worker
      state:
        waiting:
          reason: CrashLoopBackOff
          message: >-
            back-off 5m0s restarting failed container=bg-worker
            pod=bg-worker-8db7fdfd9-z5dz5_env-lab-dev-01(5c196fc5-f24c-4fcd-a3a3-d873d020f413)
      lastState:
        terminated:
          exitCode: 255
          reason: Error
          message: >
            [22:14:34 INF] Validate configuration

            [22:14:34 ERR] [Configuration][config][accounts:main]
            The account name 'acc-main-WRONG' cannot be found in the list of
            available accounts (model.Accounts).

            [22:14:34 ERR] [Configuration][config] validation failed
          startedAt: '2024-06-29T22:14:33Z'
          finishedAt: '2024-06-29T22:14:34Z'
          containerID: >-
            containerd://e09eb7180037d3bbc5382408b216600fc76aeccde0cc61594b61fe73c9da8fc0
      ready: false
# ...

Note that K8s already take care of ensuring that this containerStatuses[*].lastState.message is never too big (2048 or 80 lines), so it would this is somewhat already safe to show without worrying about adding a ton of noise.

AND that i can see that the same information is available to the engine from the pulumi CLI logs:

pulumi up --verbose=9 -y --skip-preview --logflow --logtostderr --non-interactive &> logverb.log:


...

I0629 22:29:24.751253    2943 eventsink.go:78] eventSink::Infoerr(I0629 22:29:24.745489    3539 deployment.go:704] Template change detected for replicaset "bg-worker-7b559d44c6" - last revision: "16", current revision: "17"

    pulumi:pulumi:Stack LAB-DEV-STACK running I0629 22:29:24.745484    3539 deployment.go:699] ReplicaSet "bg-worker-7b559d44c6" requests '1' replicas, but has '0' ready
    pulumi:pulumi:Stack LAB-DEV-STACK running I0629 22:29:24.745489    3539 deployment.go:704] Template change detected for replicaset "bg-worker-7b559d44c6" - last revision: "16", current revision: "17"
@ updating........I0629 22:29:29.452299    2943 eventsink.go:94] eventSink::Warning([Pod env-lab-dev-01/bg-worker-7b559d44c6-bxv5j]: containers with unready status: [bg-worker] -- [22:29:23 INF] Validate configuration
[22:29:24 ERR] [Configuration][config][accounts:main] The account name 'acc-main-WRONG' cannot be found in the list of available accounts (model.Accounts).
[22:29:24 ERR] [Configuration][config] validation failed
 ~  kubernetes:apps/v1:Deployment bg-worker-deployment updating (12s) [diff: ~spec]; warning: [Pod env-lab-dev-01/bg-worker-7b559d44c6-bxv5j]: containers with unready status: [bg-worker] -- [22:29:23 INF] Validate configuration
@ updating.............I0629 22:29:39.448897    2943 eventsink.go:94] eventSink::Warning([Pod env-lab-dev-01/bg-worker-7b559d44c6-bxv5j]: containers with unready status: [bg-worker] -- [CrashLoopBackOff] back-off 10s restarting failed container=bg-worker pod=bg-worker-7b559d44c6-bxv5j_env-lab-dev-01(b4a1ea44-f8b3-48e9-95d9-6e6787b27c6d)
 ~  kubernetes:apps/v1:Deployment bg-worker-deployment updating (22s) [diff: ~spec]; warning: [Pod env-lab-dev-01/bg-worker-7b559d44c6-bxv5j]: containers with unready status: [bg-worker] -- [CrashLoopBackOff] back-off 10s restarting failed container=bg-worker pod=bg-worker-7b559d44c6-bxv5j_env-lab-dev-01(b4a1ea44-f8b3-48e9-95d9-6e6787b27c6d)
I0629 22:29:39.744117    2943 eventsink.go:78] eventSink::Infoerr(I0629 22:29:39.744018    3539 deployment.go:587] Received update for ReplicaSet "bg-worker-7b559d44c6"

...

Note the line @ updating........I0629 22:29:29.452299 2943 eventsink.go:94] eventSink::Warning([Pod env-lab-dev-01/bg-worker-7b559d44c6-bxv5j]: containers with unready status: [bg-worker] -- [22:29:23 INF] Validate configuration [22:29:24 ERR] [Configuration][config][accounts:main] The account name 'acc-main-WRONG' cannot be found in the list of available accounts (model.Accounts).

The current implementation kinda show some of that information, but


Also, if the update is executed another time (without changing anything related what make the update fail or another thing that still cause the update to fail), **not even the first line of the actual error is not reported again**:

...

~ kubernetes:apps/v1:Deployment bg-worker-deployment updating (0s) [diff: ~spec] @ updating..... ~ kubernetes:apps/v1:Deployment bg-worker-deployment updating (1s) [diff: ~spec]; Waiting for app ReplicaSet to be available (0/1 Pods available) @ updating............. ~ kubernetes:apps/v1:Deployment bg-worker-deployment updating (41s) [diff: ~spec]; warning: [Pod env-lab-dev-01/bg-worker-7b559d44c6-bxv5j]: containers with unready status: [bg-worker] -- [Cras hLoopBackOff] back-off 20s restarting failed container=bg-worker pod=bg-worker-8db7fdfd9-z5dz5_env-lab-dev-01(f3eb5e27-e64b-489e-a388-185dad2e0307) @ updating........ ~ kubernetes:apps/v1:Deployment bg-worker-deployment updating (46s) [diff: ~spec]; error: 3 errors occurred: ~ kubernetes:apps/v1:Deployment bg-worker-deployment updating failed [diff: ~spec]; error: 3 errors occurred: @ updating..... pulumi:pulumi:Stack env-lab-dev-01 failed 1 error Diagnostics: kubernetes:apps/v1:Deployment (bg-worker-deployment): error: 3 errors occurred:



I believe this is because at the moment the is no logic that check `containerStatuses[*].lastState` if the deployment is attempt a second time.

> Note that here i am only considering the experience when using non interactive logs.
> For the interactive logs there is also the issue that, even if the message was available, it will not be very visible until the CLI exit.
> Still, in my opinion, that is a CLI/presentation issue. 
> if the logic was updated to properly handle the additional available info when present, it would be also a little bit better than now, where at the end you always only get: `[Pod ..]: containers with unready status: [...] -- [CrashLoopBackOff] back-off 20s restarting failed container=...)`

I know this would still not be the perfect (and *perfect is the enemy of good*? 😄  ) solution and does not cover ALL scenarios.
But i do think it would be a reasonable improvement.

Hope this can be helpful in some way.

(sorry, i tried to come up with a working POC, but i lack too much knowledge about the inner working of the engine to provide a PR)
blampe commented 4 months ago

@D3-LucaPiombino

Having the ability to at least provide a small indication of what the problem is directly after the operationc completes, would remove the need to escalate everything to the "devops team".

I've been responsible for a NOC or two and can definitely sympathize :)

I agree that the termination message is something we can and should be able to surface today. I've put together https://github.com/pulumi/cloud-ready-checks/pull/17 to that effect. Thank you for the excellent suggestion!

D3-LucaPiombino commented 4 months ago

@blampe That's brilliant!

blampe commented 2 months ago

We started logging the pod's termination message in v4.16. See the release notes for details about how to setup that termination message.

Please let us know how this has worked for you!