spegel-org / spegel

Stateless cluster local OCI registry mirror.
MIT License
1.22k stars 66 forks source link

Current EKS instructions break kubelet memory metrics (otel, etc.) #618

Open dfsdevops opened 1 week ago

dfsdevops commented 1 week ago

Spegel version

v0.0.25

Kubernetes distribution

EKS

Kubernetes version

v1.29.8-eks-a737599

CNI

Amazon VPC CNI

Describe the bug

I have been using Karpenter (as well as terraform,) to update the user data field when launching new nodes with the following spec:

---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
  - alias: al2@v20240910
  userData: |
    set -ex
    mkdir -p /etc/containerd/config.d
    cat > /etc/containerd/config.d/spegel.toml << EOL
    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "/etc/containerd/certs.d"
    [plugins."io.containerd.grpc.v1.cri".containerd]
      discard_unpacked_layers = false
    EOL
    /etc/eks/bootstrap.sh

I'm basing this off these instructions https://github.com/spegel-org/spegel/blob/main/docs/COMPATIBILITY.md#eks

there seems to be an issue in that all of our Otel metrics for memory usage from the kubeletstats receiver are coming back as 0 rather than the number of bytes. I suspect that the patch is clobbering a containerd config dictionary that has an effect on kubelet metrics, as per this comment, a potential reason: https://github.com/awslabs/amazon-eks-ami/issues/1628#issuecomment-2324774464

might be related to the default_runtime_name key thats being used the table here:

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
discard_unpacked_layers = true

My suggestion is that the documentation switch to using sed to override the template, as this seems to avoid the issue. I confirmed that after switching to this in my userData: field in Karpenter, I begin to see memory metrics in Otel again.

sed -i 's/discard_unpacked_layers = true/discard_unpacked_layers = false/g'  /etc/eks/containerd/containerd-config.toml
phillebaba commented 1 week ago

This sounds like a problem in how Containerd merges configuration. Alternatively I have missed documentation about how overrides work. Using sed is not a recommended method as the EKS AMI can change their Containerd configuration at any time. We worked with them to get the imports working for this reason.

First step is to confirm that Containerd is omitting the original configuration when imports are used. After that we can move forward with figuring out how to solve this upstream if that is the case.

dfsdevops commented 1 week ago

I certainly empathize with the fact that this was a carefully considered solution and trying to avoid issues with upstream changes. I agree it would be ideal to ensure containerd properly merges the TOML rather than hack around it with sed. It looks like they have an open issue for this https://github.com/containerd/containerd/issues/5837

Maybe I can use the crictl command to do a before/after to see if the actual running config difference is identifiable. I didn't know about that originally.

phillebaba commented 5 days ago

Thanks for linking the issue. I understand the problem a lot better now. So the problem is that the merging is not done at a deeper level than per section. Which explains why some of the original configuration is disappearing.

Could one option be to include the original configuration in the import file? This way we include all configuration again making sure that nothing disappears. Hopefully this is something that we can fix in Containerd v2.

We could update the docs if this temporary solution works for you.