stackrox / kube-linter

KubeLinter is a static analysis tool that checks Kubernetes YAML files and Helm charts to ensure the applications represented in them adhere to best practices.
https://docs.kubelinter.io/
Apache License 2.0
2.9k stars 231 forks source link

[FEATURE_REQUEST] Invalid memory limit #189

Open noahlz opened 3 years ago

noahlz commented 3 years ago

Description of the problem/feature request We are evaluating kube-linter to address a recent error we experienced. We deployed a pod with an invalid memory limit:

          resources:
            limits:
              cpu: 500m
              memory: 3500m

Note 3500m instead of 3500M

Description of the existing behavior vs. expected behavior

We expected kube-linter to flag the invalid resource memory limit. Per the documentation for this setting:

https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

The memory resource is measured in bytes. You can express memory as a plain integer or a fixed-point integer with one of these suffixes: E, P, T, G, M, K, Ei, Pi, Ti, Gi, Mi, Ki.
viswajithiii commented 3 years ago

Thanks @noahlz for filing this issue!

This is an interesting one. What actually happens in this case? Does kube refuse to create the pod? Or does it create the pod, but with a memory limit equal to "3500m" -- 3.5 bytes (obviously way too low to allow anything to happen)?

noahlz commented 3 years ago

Great question! In this case, the pods failed to launch with the reason FailedCreatePodSandBox as the symptom. We also saw the ProgressDeadlineExceeded and MinimumReplicasUnavailable events (but did not have alertmanager triggering PagerDuty or sending Slack messages from them...so they slipped by).

The main challenge was that there was no obvious "you got the memory limit wrong!" error, so we had to dig and google a bit before finding the typo.

The actual pods failed to launch with reason:

Message: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "analytics-worker-786f77c764-4kmz9": Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:338: getting the final child's pid from pipe caused: read init-p: connection reset by peer: unknown

which wasn't exactly helpful.

Thanks to the upgrade process keeping old pods around (an excellent feature), this issue actually went unnoticed for several days until the old deploy's image finally purged from our repo (ECR), and the existing replicaset from the last good deploy could no longer scale due to ImagePullBackOff

We're on v1.18.9-eks-d1db3c and helm v3.5.0

viswajithiii commented 3 years ago

Got it. Interesting. That definitely isn't a helpful error message.

The best way to do this in kube-linter is probably to add a built-in check for absurdly low memory limits -- say, anything less than 50MB probably means there was a typo. Does this make sense? We can also probably do something similar for absurdly high CPU requests/limits, to catch the reverse error.

noahlz commented 3 years ago

It's not clear to me if kube launched the pod with low memory, or merely failed to launch at all due to the invalid configuration. It seems like the former? Would have to investigate how it parses 3500m

viswajithiii commented 3 years ago

It's not clear to me if kube launched the pod with low memory, or merely failed to launch at all due to the invalid configuration. It seems like the former? Would have to investigate how it parses 3500m

I think it's the former. Kube just parses these as "quantities" (code here), and isn't opinionated about which unit is used for which resource. So 3500m is just parsed as 3.5, which means the memory limit is 3.5 bytes.

The use of the appropriate quantity for the resource is also an interesting idea for a check. For now, I'm just going to add one that looks for absurdly low, since that doesn't need any code changes.