Support for IPv6/dualstack

james-callahan commented 1 year ago

What would you like to be added:

I'd like to start using dualstack in our kubernetes cluster via the CloudDualStackNodeIPs feature gate. Trying to do so I get errors such as:

I0814 03:40:03.764807       1 node_controller.go:427] Initializing node i-0a11a57aeffb69cf7 with cloud provider
E0814 03:40:04.070450       1 node_controller.go:236] error syncing 'i-0a11a57aeffb69cf7': failed to get node modifiers from cloud provider: provided node ip for node "i-0a11a57aeffb69cf7" is not valid: failed to get node address from cloud provider that matches ip: 2600:1f10:45a5:a900:33fc:a923:65e5:9414, requeuing

Trying to debug the issue, I think it's because the code at https://github.com/kubernetes/cloud-provider-aws/blob/d0551093673e8c355db17249b8f069767c014748/pkg/providers/v2/instances.go#L216C46-L216C64 doesn't look at Ipv6Addresses. It only iterates over the IPv4 addresses in PrivateIpAddresses.

Why is this needed:

The EC2 api returns IPv6 and IPv4 addresses in different fields.

/kind feature

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

olemarkus commented 1 year ago

AWS CCM has been patching in both IPv6 and IPv4 IPs for quite some time. You just have to set NodeIPFamilies to something like ipv6 and ipv4.

See https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L1599

james-callahan commented 1 year ago

AWS CCM has been patching in both IPv6 and IPv4 IPs for quite some time. You just have to set NodeIPFamilies to something like ipv6 and ipv4.

See https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L1599

I'm using v2, not v1.

james-callahan commented 1 year ago

I'm using v2, not v1.

As found in https://github.com/kubernetes/cloud-provider-aws/issues/677 I'm using v1 after all.

I gave this another attempt, setting the feature gate CloudDualStackNodeIPs=true, and the cloud provider failed with e.g.:

I1017 02:43:36.977098       1 node_controller.go:431] Initializing node i-02de3f9b2d02feaa7 with cloud provider
E1017 02:43:37.264596       1 node_controller.go:240] error syncing 'i-02de3f9b2d02feaa7': failed to get node modifiers from cloud provider: provided node ip for node "i-02de3f9b2d02feaa7" is not valid: failed to get node address from cloud provider that matches ip: 2600:1f10:45a5:a918:5d99:c7b9:243:210f, requeuing

I realised that NodeIPFamilies defaults to only ipv4, so I added ipv6 to my cloudconfig:

[Global]
NodeIPFamilies=ipv4,ipv6

Which I can verify works via the log line:

I1017 02:58:51.340872       1 aws.go:1433] The following IP families will be added to nodes: [ipv4,ipv6]

The controller is now failing with e.g.:

I1017 03:04:58.888797       1 node_controller.go:431] Initializing node i-083e6ed22b10ddf06 with cloud provider
E1017 03:04:59.302680       1 node_controller.go:240] error syncing 'i-083e6ed22b10ddf06': failed to get node modifiers from cloud provider: provided node ip for node "i-083e6ed22b10ddf06" is not valid: failed to get node address from cloud provider that matches ip: 10.24.152.220, requeuing
I1017 03:04:59.302717       1 node_controller.go:431] Initializing node i-083e6ed22b10ddf06 with cloud provider
E1017 03:04:59.548721       1 node_controller.go:240] error syncing 'i-083e6ed22b10ddf06': failed to get node modifiers from cloud provider: provided node ip for node "i-083e6ed22b10ddf06" is not valid: failed to get node address from cloud provider that matches ip: 10.24.152.220, requeuing
I1017 03:05:01.368647       1 node_controller.go:431] Initializing node i-083e6ed22b10ddf06 with cloud provider
E1017 03:05:01.690156       1 node_controller.go:240] error syncing 'i-083e6ed22b10ddf06': failed to get node modifiers from cloud provider: provided node ip for node "i-083e6ed22b10ddf06" is not valid: failed to get node address from cloud provider that matches ip: 10.24.152.220, requeuing
I1017 03:05:05.698132       1 node_controller.go:431] Initializing node i-083e6ed22b10ddf06 with cloud provider
E1017 03:05:06.089973       1 node_controller.go:240] error syncing 'i-083e6ed22b10ddf06': failed to get node modifiers from cloud provider: provided node ip for node "i-083e6ed22b10ddf06" is not valid: failed to get node address from cloud provider that matches ip: 10.24.152.220, requeuing
I1017 03:05:14.785853       1 node_controller.go:431] Initializing node i-083e6ed22b10ddf06 with cloud provider
E1017 03:05:15.083704       1 node_controller.go:240] error syncing 'i-083e6ed22b10ddf06': failed to get node modifiers from cloud provider: provided node ip for node "i-083e6ed22b10ddf06" is not valid: failed to get node address from cloud provider that matches ip: 10.24.152.220, requeuing

I'm not sure why it's failing to get the node address, see aws ec2 describe-instances --instance-ids i-083e6ed22b10ddf06 | jq '.Reservations[].Instances[] | {PrivateIpAddress,Ipv6Address,NetworkInterfaces}'

{
  "PrivateIpAddress": "10.24.152.220",
  "Ipv6Address": "2600:1f10:45a5:a918:fd18:12af:1613:6c5d",
  "NetworkInterfaces": [
    {
      "Association": {
        "IpOwnerId": "amazon",
        "PublicDnsName": "ec2-3-85-73-150.compute-1.amazonaws.com",
        "PublicIp": "3.85.73.150"
      },
      "Attachment": {
        "AttachTime": "2023-10-17T03:03:45+00:00",
        "AttachmentId": "eni-attach-024b4933411c5f575",
        "DeleteOnTermination": true,
        "DeviceIndex": 0,
        "Status": "attached",
        "NetworkCardIndex": 0
      },
      "Description": "",
      "Groups": [
        {
          "GroupName": "internal-talos-worker-general",
          "GroupId": "sg-007b939554373cc2b"
        }
      ],
      "Ipv6Addresses": [
        {
          "Ipv6Address": "2600:1f10:45a5:a918:fd18:12af:1613:6c5d",
          "IsPrimaryIpv6": false
        }
      ],
      "MacAddress": "0e:41:8b:af:7f:5f",
      "NetworkInterfaceId": "eni-0aabf40c0e2dcd595",
      "OwnerId": "799078726966",
      "PrivateDnsName": "i-083e6ed22b10ddf06.ec2.internal",
      "PrivateIpAddress": "10.24.152.220",
      "PrivateIpAddresses": [
        {
          "Association": {
            "IpOwnerId": "amazon",
            "PublicDnsName": "ec2-3-85-73-150.compute-1.amazonaws.com",
            "PublicIp": "3.85.73.150"
          },
          "Primary": true,
          "PrivateDnsName": "i-083e6ed22b10ddf06.ec2.internal",
          "PrivateIpAddress": "10.24.152.220"
        }
      ],
      "SourceDestCheck": true,
      "Status": "in-use",
      "SubnetId": "subnet-00c5e1b9c4baddcb3",
      "VpcId": "vpc-060c91b3879fc8b83",
      "InterfaceType": "interface"
    }
  ]
}

mmerkes commented 1 year ago

From poking around the code and seeing your info above, it's not apparent to me what went wrong yet. Would it be convenient to add additional logging? Would be curious what addresses get returned by the cloud provider given that the IP it's looking for is very apparent.

james-callahan commented 1 year ago

Would it be convenient to add additional logging?

Not really for our configuration; would have to set up a whole custom build pipeline where we currently use the upstream image.

Would be curious what addresses get returned by the cloud provider given that the IP it's looking for is very apparent.

Yeah that's probably a good debug log to add. Might be good to add it in any case?

mmerkes commented 1 year ago

Not really for our configuration; would have to set up a whole custom build pipeline where we currently use the upstream image.

A repro would make it a lot easier to debug. Perhaps it could be setup via another mechanism, if it's an issue with the cloud provider.

Yeah that's probably a good debug log to add. Might be good to add it in any case?

Ya. There's not a lot of logging in the cloud provider, though some of this could make sense to add in kubernetes/kubernetes, and seems very reasonable to add some debug level logging for exactly this kind of thing.

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

james-callahan commented 8 months ago

/remove-lifecycle stale

I would love it if someone could just add some more debug logging around this in the cloud provider. Then once there's another release I'd be able to share debug logs.

akunszt commented 6 months ago

We face the same issue. I created a cloud-config file and set the NodeIPFamilies and I can see that it is in-use in the aws-cloud-controller-manager logs. I also had to add --feature-gates=CloudDualStackNodeIPs=true to the aws-cloud-controller-manager and kubelet. When I set --node-ip=<IPv6 address>,<IPv4 address> to the kubelet then I receive log lines like this and the node was tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule.

024-05-08T08:21:12.851616193Z E0508 08:21:12.851520       1 node_controller.go:240] error syncing 'i-08b4defa905155953.eu-west-1.compute.internal': failed to get node modifiers from cloud provider: provided node ip for node "i-08b4defa905155953.eu-west-1.compute.internal" is not valid: failed to get node address from cloud provider that matches ip: 2xxx:xxxx:xxxx:xxxx::c91a, requeuing

But I saw both the IPv6 and the IPv4 address in the InternalIP. Then I set --node-ip=:: for the kubelet and it suddenly started to work but I saw only the IPv6 address in the InternalIP. Which is kinda expected based on the kubelet documentation. This is our test cluster, if you tell me what logs/tests do you want then I can execute them.

akunszt commented 6 months ago

I think I found what caused this. I added a lot of klog.* lines to the NodeAddressesByProviderID function. This was interesting:

        for _, family := range c.cfg.Global.NodeIPFamilies {
                klog.Infof( "family: %v", family )

It generated this log line:

I0508 10:10:40.861561     881 aws.go:1676] family: ipv4,ipv6

So the configuration is parsed as a string ipv4,ipv6 instead of splitting the values into an array. I dug a little deeper and I found out how to set a multi-value configuration at https://pkg.go.dev/gopkg.in/gcfg.v1#example-ReadStringInto-Multivalue After I changed the cloud-config.conf to this everything started to work.

[Global]
NodeIPFamilies=ipv4
NodeIPFamilies=ipv6

I recommend to include this in the documentation. It was a bit frustrating that I had to read the code as I did not find any documentation about how to construct the cloud-config file (I even started with a YAML first).

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/cloud-provider-aws/issues/638#issuecomment-2395028144): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / cloud-provider-aws

Support for IPv6/dualstack #638