kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
438 stars 149 forks source link

Provisions nodes as ipv4 instead of ipv6 when there are control plane issues #1384

Open pelzerim opened 1 week ago

pelzerim commented 1 week ago

Description

Observed Behavior:

Karpenter created ipv4 nodes in a ipv6 EKS cluster.

The nodeclaims were unable to get into status Initialized:

  Conditions:
    Last Transition Time:  2024-07-01T23:14:18Z
    Message:               KnownEphemeralTaint "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" still exists
    Reason:                KnownEphemeralTaintsExist
    Status:                False
    Type:                  Initialized
    Last Transition Time:  2024-07-01T23:12:20Z

The cloud controller was unable to provide node information:

I0702 07:23:04.241540      12 node_controller.go:229] error syncing 'ip-10-0-100-244.ap-northeast-2.compute.internal': failed to get node modifiers from cloud provider: provided node ip for node "ip-10-0-100-244.ap-northeast-2.compute.internal" is not valid: failed to get node address from cloud provider that matches ip: 10.0.100.244, requeuing

Inspecting the user data of the created ec2 instances reveals that the instances are missing these flags (which are there for older nodes):

--ip-family ipv6 \
--dns-cluster-ip 'fdac:***:****::a' \

Killing karpenter pods and removing all stuck nodeclaims by hand did resolve the issue.

This AWS EKS cluster is known to have performance issues with the control plane. We have already seen this issue. We are dealing with AWS support to get it fixed.

A bit of digging reveals that the decision to add the flags is dynamic.

Expected Behavior:

Karpenter should be able to provision IPV6 nodes even if the control plane is temporarily unavailable.

Having control over this setting directly would also be very useful?

Versions:

k8s-ci-robot commented 1 week ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
jigisha620 commented 5 days ago

Have you configured metadataOptions on your EC2nodeClass? This gives you an option to enable IPv6 endpoint for your instances. This option is disabled by default. Can you share what your EC2nodeClass looks like?

pelzerim commented 5 days ago

@jigisha620 Thanks for the prompt response! We do not have that option enabled:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: ${name}
spec:
  amiFamily: AL2
  role: ${role}
  subnetSelectorTerms:
    %{ for subnet_id in subnet_ids }
    - id: "${subnet_id}"
    %{ endfor }
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${cluster_name}
  tags:
    karpenter.sh/discovery: ${cluster_name}
  amiSelectorTerms:
    - id: ${ami_id}
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeType: gp3
        volumeSize: ${disk_size_gi}Gi
        deleteOnTermination: true
        encrypted: true

Will the change to the metadata service affect the instance kubelet configuration? The corresponding code seems to use ClusterDNS IP: https://github.com/aws/karpenter-provider-aws/blob/e8a345723c8db785bd07b8595c395edbdfb9255b/pkg/providers/amifamily/bootstrap/eksbootstrap.go#L122

jigisha620 commented 2 days ago

Hi @pelzerim, You are right. I misunderstood. Wondering if you have specified ClusterDNS via spec.kubeletConfiguration since we rely on clusterDNS to pass --ip-family ipv6. Can you also share your Karpenter controller logs from the time when this happened? Wondering if there was something that prevented Karpenter from discovering the clusterDNS.