Provisions nodes as ipv4 instead of ipv6 when there are control plane issues

pelzerim commented 4 months ago

Description

Observed Behavior:

Karpenter created ipv4 nodes in a ipv6 EKS cluster.

The nodeclaims were unable to get into status Initialized:

  Conditions:
    Last Transition Time:  2024-07-01T23:14:18Z
    Message:               KnownEphemeralTaint "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" still exists
    Reason:                KnownEphemeralTaintsExist
    Status:                False
    Type:                  Initialized
    Last Transition Time:  2024-07-01T23:12:20Z

The cloud controller was unable to provide node information:

I0702 07:23:04.241540      12 node_controller.go:229] error syncing 'ip-10-0-100-244.ap-northeast-2.compute.internal': failed to get node modifiers from cloud provider: provided node ip for node "ip-10-0-100-244.ap-northeast-2.compute.internal" is not valid: failed to get node address from cloud provider that matches ip: 10.0.100.244, requeuing

Inspecting the user data of the created ec2 instances reveals that the instances are missing these flags (which are there for older nodes):

--ip-family ipv6 \
--dns-cluster-ip 'fdac:***:****::a' \

Killing karpenter pods and removing all stuck nodeclaims by hand did resolve the issue.

This AWS EKS cluster is known to have performance issues with the control plane. We have already seen this issue. We are dealing with AWS support to get it fixed.

A bit of digging reveals that the decision to add the flags is dynamic.

Expected Behavior:

Karpenter should be able to provision IPV6 nodes even if the control plane is temporarily unavailable.

Having control over this setting directly would also be very useful?

Versions:

Karpenter CRD Chart Version: 0.37.0
Karpenter Chart version: 0.37.0
Kubernetes Version (kubectl version): v1.30.0-eks-036c24b
AWS terraform-aws-modules/eks/aws//modules/karpenter version: 20.8.5
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

k8s-ci-robot commented 4 months ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

jigisha620 commented 4 months ago

Have you configured metadataOptions on your EC2nodeClass? This gives you an option to enable IPv6 endpoint for your instances. This option is disabled by default. Can you share what your EC2nodeClass looks like?

pelzerim commented 4 months ago

@jigisha620 Thanks for the prompt response! We do not have that option enabled:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: ${name}
spec:
  amiFamily: AL2
  role: ${role}
  subnetSelectorTerms:
    %{ for subnet_id in subnet_ids }
    - id: "${subnet_id}"
    %{ endfor }
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${cluster_name}
  tags:
    karpenter.sh/discovery: ${cluster_name}
  amiSelectorTerms:
    - id: ${ami_id}
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeType: gp3
        volumeSize: ${disk_size_gi}Gi
        deleteOnTermination: true
        encrypted: true

Will the change to the metadata service affect the instance kubelet configuration? The corresponding code seems to use ClusterDNS IP: https://github.com/aws/karpenter-provider-aws/blob/e8a345723c8db785bd07b8595c395edbdfb9255b/pkg/providers/amifamily/bootstrap/eksbootstrap.go#L122

jigisha620 commented 4 months ago

Hi @pelzerim, You are right. I misunderstood. Wondering if you have specified ClusterDNS via spec.kubeletConfiguration since we rely on clusterDNS to pass --ip-family ipv6. Can you also share your Karpenter controller logs from the time when this happened? Wondering if there was something that prevented Karpenter from discovering the clusterDNS.

pelzerim commented 4 months ago

@jigisha620 Sadly we could not find relevant logs (errors) form karpenter from around the time of the incident.

We however had kubernetes api throttling issues in the past with that cluster and i strongly suspect something in that direction. I've stumbled upon this FlowScheme here and we indeed do not run karpenter in the kube-system namespace. I've added the FlowScheme manually now and lets see if this goes away.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

nuvme-devops commented 4 days ago

Hello @pelzerim, I have the same problem here.

kubernetes-sigs / karpenter

Provisions nodes as ipv4 instead of ipv6 when there are control plane issues #1384

Description