kubernetes-sigs / cloud-provider-equinix-metal

Kubernetes Cloud Provider for Equinix Metal (formerly Packet Cloud Controller Manager)
https://deploy.equinix.com/labs/cloud-provider-equinix-metal
Apache License 2.0
74 stars 26 forks source link

IP address confusion with JuJu provider #193

Closed thebsdbox closed 2 years ago

thebsdbox commented 3 years ago

So AFAICT a Juju Kubernetes deployment uses a fan network (https://juju.is/docs/olm/fan-container-networking) which appears to be an additional network on top of the existing network to simplify address usage.

It appears that the CCM and/or Kubernetes Controller code is having trouble reconciling the FAN network address with that of the node itself?

dan@code:~$ kubectl logs -n kube-system                       cloud-provider-equinix-metal-7898974cf8-8fjmp
I0727 08:22:30.960667       1 main.go:232] authToken: '<masked>'
I0727 08:22:30.960748       1 main.go:232] projectID: '5f5f52fb-3d9a-4400-902b-cb6ca713401c'
I0727 08:22:30.960760       1 main.go:232] load balancer config: ''%!s(MISSING)
I0727 08:22:30.960768       1 main.go:232] kube-vip://
I0727 08:22:30.960775       1 main.go:232] facility: 'da11'
I0727 08:22:30.960782       1 main.go:232] local ASN: '65000'
I0727 08:22:30.960789       1 main.go:232] Elastic IP Tag: ''
I0727 08:22:30.960795       1 main.go:232] API Server Port: '6443'
I0727 08:22:30.960802       1 main.go:232] BGP Node Selector: ''
I0727 08:22:31.174279       1 serving.go:331] Generated self-signed cert in-memory
W0727 08:22:31.410237       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0727 08:22:31.411214       1 controllermanager.go:127] Version: v0.0.0-master+$Format:%h$
I0727 08:22:31.411750       1 secure_serving.go:197] Serving securely on [::]:10258
I0727 08:22:31.411817       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0727 08:22:31.426201       1 loadbalancers.go:69] loadbalancer implementation enabled: kube-vip
2021/07/27 08:22:31 [DEBUG] GET https://api.equinix.com/metal/v1/projects/5f5f52fb-3d9a-4400-902b-cb6ca713401c/bgp-config?
E0727 08:22:31.547653       1 cloud.go:206] failed to update and sync node for add juju-38bb12-0 for handler: no provider ID given for node juju-38bb12-0
E0727 08:22:31.548052       1 cloud.go:206] failed to update and sync node for add juju-38bb12-0 for handler: no provider ID given
E0727 08:22:31.548076       1 cloud.go:206] failed to update and sync node for add juju-38bb12-0 for handler: control plane loadbalancer elastic ip tag is empty. Nothing to do
E0727 08:22:31.548101       1 cloud.go:206] failed to update and sync node for add juju-38bb12-1 for handler: no provider ID given for node juju-38bb12-1
E0727 08:22:31.548120       1 cloud.go:206] failed to update and sync node for add juju-38bb12-1 for handler: no provider ID given
E0727 08:22:31.548146       1 cloud.go:206] failed to update and sync node for add juju-38bb12-1 for handler: control plane loadbalancer elastic ip tag is empty. Nothing to do
E0727 08:22:31.548175       1 cloud.go:206] failed to update and sync node for add juju-38bb12-2 for handler: no provider ID given for node juju-38bb12-2
E0727 08:22:31.548213       1 cloud.go:206] failed to update and sync node for add juju-38bb12-2 for handler: no provider ID given
E0727 08:22:31.548229       1 cloud.go:206] failed to update and sync node for add juju-38bb12-2 for handler: control plane loadbalancer elastic ip tag is empty. Nothing to do
I0727 08:22:31.641414       1 cloud.go:242] nodes watcher started
2021/07/27 08:22:31 [DEBUG] GET https://api.equinix.com/metal/v1/projects/5f5f52fb-3d9a-4400-902b-cb6ca713401c/ips?
I0727 08:22:31.741831       1 cloud.go:297] services watcher started
I0727 08:22:31.743575       1 node_lifecycle_controller.go:77] Sending events to api server
W0727 08:22:31.743687       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
I0727 08:22:31.743734       1 controllermanager.go:254] Started "cloud-node-lifecycle"
E0727 08:22:31.745696       1 core.go:97] Failed to start service controller: the cloud provider does not support external load balancers
W0727 08:22:31.745726       1 controllermanager.go:251] Skipping "service"
I0727 08:22:31.745740       1 core.go:108] Will not configure cloud provider routes for allocate-node-cidrs: false, configure-cloud-routes: true.
W0727 08:22:31.745752       1 controllermanager.go:251] Skipping "route"
I0727 08:22:31.747303       1 node_controller.go:108] Sending events to api server.
W0727 08:22:31.747313       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
I0727 08:22:31.747319       1 controllermanager.go:254] Started "cloud-node"
I0727 08:22:31.755381       1 node_controller.go:329] Initializing node juju-38bb12-2 with cloud provider
W0727 08:22:31.762027       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
2021/07/27 08:22:31 [DEBUG] GET https://api.equinix.com/metal/v1/projects/5f5f52fb-3d9a-4400-902b-cb6ca713401c/devices?include=facility
W0727 08:22:32.135989       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
2021/07/27 08:22:32 [DEBUG] GET https://api.equinix.com/metal/v1/devices/89af9b0f-f606-4683-bb48-e77c66ce7bdc?include=facility
2021/07/27 08:22:32 [DEBUG] GET https://api.equinix.com/metal/v1/devices/89af9b0f-f606-4683-bb48-e77c66ce7bdc?include=facility
E0727 08:22:32.389151       1 cloud.go:262] failed to update and sync service for add kubernetes-dashboard/dashboard-metrics-scraper: elastic ip tag is empty. Nothing to do
2021/07/27 08:22:32 [DEBUG] GET https://api.equinix.com/metal/v1/projects/5f5f52fb-3d9a-4400-902b-cb6ca713401c/ips?
E0727 08:22:32.440151       1 node_controller.go:370] failed to initialize node juju-38bb12-2 at cloudprovider: failed to find kubelet node IP from cloud provider
I0727 08:22:32.440205       1 node_controller.go:329] Initializing node juju-38bb12-0 with cloud provider
W0727 08:22:32.445979       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
2021/07/27 08:22:32 [DEBUG] GET https://api.equinix.com/metal/v1/projects/5f5f52fb-3d9a-4400-902b-cb6ca713401c/devices?include=facility
W0727 08:22:32.794617       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
2021/07/27 08:22:32 [DEBUG] GET https://api.equinix.com/metal/v1/devices/84d83fa1-588d-4020-800b-a07928a00c38?include=facility
2021/07/27 08:22:32 [DEBUG] GET https://api.equinix.com/metal/v1/devices/84d83fa1-588d-4020-800b-a07928a00c38?include=facility
E0727 08:22:33.090545       1 node_controller.go:370] failed to initialize node juju-38bb12-0 at cloudprovider: failed to find kubelet node IP from cloud provider
I0727 08:22:33.090595       1 node_controller.go:329] Initializing node juju-38bb12-1 with cloud provider
W0727 08:22:33.103073       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
2021/07/27 08:22:33 [DEBUG] GET https://api.equinix.com/metal/v1/projects/5f5f52fb-3d9a-4400-902b-cb6ca713401c/devices?include=facility
E0727 08:22:33.257858       1 cloud.go:262] failed to update and sync service for add kubernetes-dashboard/kubernetes-dashboard: elastic ip tag is empty. Nothing to do
2021/07/27 08:22:33 [DEBUG] GET https://api.equinix.com/metal/v1/projects/5f5f52fb-3d9a-4400-902b-cb6ca713401c/ips?
W0727 08:22:33.452488       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
2021/07/27 08:22:33 [DEBUG] GET https://api.equinix.com/metal/v1/devices/35b9b854-ad28-4510-856d-b7f7c3ebccc0?include=facility
2021/07/27 08:22:33 [DEBUG] GET https://api.equinix.com/metal/v1/devices/35b9b854-ad28-4510-856d-b7f7c3ebccc0?include=facility
E0727 08:22:33.895424       1 node_controller.go:370] failed to initialize node juju-38bb12-1 at cloudprovider: failed to find kubelet node IP from cloud provider

The main error being failed to find kubelet node IP from cloud provider

If we look at the nodes:

dan@code:~$ kubectl describe node juju-38bb12-0
Name:               juju-38bb12-0
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    juju-application=kubernetes-worker
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=juju-38bb12-0
                    kubernetes.io/os=linux
Annotations:        alpha.kubernetes.io/provided-node-ip: 252.6.0.1
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 26 Jul 2021 17:56:12 +0100
Taints:             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  juju-38bb12-0
  AcquireTime:     <unset>
  RenewTime:       Tue, 27 Jul 2021 11:24:18 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 27 Jul 2021 11:23:18 +0100   Mon, 26 Jul 2021 17:56:12 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 27 Jul 2021 11:23:18 +0100   Mon, 26 Jul 2021 17:56:12 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 27 Jul 2021 11:23:18 +0100   Mon, 26 Jul 2021 17:56:12 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 27 Jul 2021 11:23:18 +0100   Mon, 26 Jul 2021 17:57:12 +0100   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  252.6.0.1
  Hostname:    juju-38bb12-0
Capacity:
  cpu:                16
  ephemeral-storage:  459395020Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32615544Ki
  pods:               110
Allocatable:
  cpu:                16
  ephemeral-storage:  423378449732
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32513144Ki
  pods:               110
System Info:
  Machine ID:                 085c810006a14ea680085b21a4623d16
  System UUID:                03000200-0400-0500-0006-000700080009
  Boot ID:                    10eedcda-651c-44cf-938b-d5a087e7a668
  Kernel Version:             5.4.0-77-generic
  OS Image:                   Ubuntu 20.04.2 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.5.2
  Kubelet Version:            v1.21.1
  Kube-Proxy Version:         v1.21.1
Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  hugepages-1Gi      0 (0%)    0 (0%)
  hugepages-2Mi      0 (0%)    0 (0%)
Events:              <none>

It looks like alpha.kubernetes.io/provided-node-ip: 252.6.0.1 is being compared with what is expected from the EM API?

At the moment the Juju clusters are stuck as they're all tainted as unready and the CCM/Controller code can't proceed.

For testing I removed the annotation and the following happened:

I0727 08:31:59.331527       1 node_controller.go:329] Initializing node juju-38bb12-2 with cloud provider
W0727 08:31:59.339143       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
2021/07/27 08:31:59 [DEBUG] GET https://api.equinix.com/metal/v1/projects/5f5f52fb-3d9a-4400-902b-cb6ca713401c/devices?include=facility
W0727 08:32:00.409276       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
2021/07/27 08:32:00 [DEBUG] GET https://api.equinix.com/metal/v1/devices/89af9b0f-f606-4683-bb48-e77c66ce7bdc?include=facility
2021/07/27 08:32:00 [DEBUG] GET https://api.equinix.com/metal/v1/devices/89af9b0f-f606-4683-bb48-e77c66ce7bdc?include=facility
2021/07/27 08:32:00 [DEBUG] GET https://api.equinix.com/metal/v1/devices/89af9b0f-f606-4683-bb48-e77c66ce7bdc?include=facility
W0727 08:32:00.842533       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
2021/07/27 08:32:00 [DEBUG] GET https://api.equinix.com/metal/v1/devices/89af9b0f-f606-4683-bb48-e77c66ce7bdc?include=facility
I0727 08:32:00.997496       1 node_controller.go:397] Successfully initialized node juju-38bb12-2 with cloud provider
W0727 08:32:31.771450       1 cloud.go:156] The Equinix Metal cloud provider does not support InstancesV2
2021/07/27 08:32:31 [DEBUG] GET https://api.equinix.com/metal/v1/devices/89af9b0f-f606-4683-bb48-e77c66ce7bdc?include=facility
E0727 08:32:31.916478       1 node_controller.go:281] Specified Node IP not found in cloudprovider for node "juju-38bb12-2"
deitch commented 3 years ago

I spent the last hour digging into the source for cloud-provider, trying to see why this happens, or even how it can work at all. In the end, I am not sure it should.

Look at this description of the annotation:

When kubelet is started with the "external" cloud provider, it sets this annotation on the Node to denote an IP address set from the command line flag (--node-ip). This IP is verified with the cloud provider as valid by the cloud-controller-manager

If you set an explicit IP address using --node-ip, then it must be verified with the cloud provider via CCM. As I read this, it is illegitimate to set an IP that isn't valid with the cloud provider. To be fair, I don't know why you would, as that would make the node unroutable and useless.

If juju is setting --node-ip to something invalid for the cloud provider (in this case, EQXM), then it is doing something wrong.

From a different perspective, having read the fan networking docs, my understanding is that this should affect the containers only, not the underlying nodes. So why are we setting --node-ip to anything?

grebennikov commented 3 years ago

@deitch The reason is the following. I have to deploy components of the control plane into the linux containers so that I don't over-consume physical nodes running only one piece of it (kube-controller, etcd etc). Each container should have an IP address and it can be assigned with fan address only. Because there is no concept of spaces for now - all subnets available on the node fall under one default space and the "ingress_address" array of the worker only contains the fan address. I can't force it to contain multiple addresses manually.

The charm reads the content of that array and finds only one address which happens to be fan address. This is address becomes the "node-ip" for kubelet.

deitch commented 3 years ago

You actually can deploy into a container so you manage resources, while sharing the IP space with the underlying host. Since containers really are just constructs combined of chroot and namespaces and cgroups, you could put them in containers that do not have a dedicated network namespace. In docker terms, it is just --network host, but it is the same thing.

You may have a good reason, but I don't know that we can do anything with it. It sounds to me like the usage is in conflict with how Kubernetes expects things to work.

displague commented 2 years ago

I was looking at a cluster configuration with @cprivitere where each node had:

The VLAN is used for Node addressing in order to stretch the cluster to other environments.

When the CPEM registers node addresses it currently overrides the InternalIP address (192.168/16) with the unused EM Project IP (10/8).

In this environment, it would be desirable to disable the InternalIP describing function of the cloud provider:

https://github.com/equinix/cloud-provider-equinix-metal/blob/cdbe540a3b6132e3d1c8a9df3be48a2eb2478302/metal/devices.go#L110-L113

An argument like --without-node-internal-ip (fit to our current naming practice) could be how we deliver this.

Thoughts, @deitch.

deitch commented 2 years ago

I am not totally sure I understand what you are asking for @displague .

The way it works, CPEM just gets all of the addresses and appends them to the list provided to InstanceMetadata(). In the case you described, it should find 1 of type v1.NodeExternalIP and 2 of type v1.NodeInternalIP. That assumes that the VLAN address is returned from the EQXM API with address.Public == false.

So k8s gets 3 addresses. What it chooses to do with it is largely its choice, is it not? Is there some official way to tell k8s, "I am giving you 3 private addresses, this is the one you pick for InternalIP?

displague commented 2 years ago

@deitch in the case of a VLAN address, EM API doesn't manage the address and knows nothing about it. The user defined the VLAN address space as the InternalIP range during kubeadm install, for example. InternalIP is set to 192.168.x.x before CPEM is installed.

These nodes are using Hybrid bonding modes, so the EM API is going to include a 10.x.x.x address and a public address if one was requested, or by default.

In this Hybrid case, the user could prefer to keep the original InternalIP, the one that works over the VLAN, attached to stretch cluster nodes.

The feature request then is to have a way to tell CPEM to skip over any public=false, InternalIP, overrides it would have done.

deitch commented 2 years ago

Ah, I didn't think about that part. If we update my earlier comment, then it looks like this.

Normal Scenario

  1. Device has 1 EQXM-assigned private IP and 1 EQXM-assigned public IP
  2. cloud-provider calls CPEM InstanceMetadata()
  3. CPEM uses EQXM API and discovers 1 EQXM-assigned private IP and 1 EQXM-assigned public IP
  4. CPEM returns EQXM-API assigned private and public IPs that it found
  5. k8s uses the private as InternalIP

All is good

VLAN scenario

(new stuff in bold)

  1. Device has 1 EQXM-assigned private IP and 1 EQXM-assigned public IP
  2. User adds VLAN with private IP
  3. cloud-provider calls CPEM InstanceMetadata()
  4. CPEM uses EQXM API and discovers 1 EQXM-assigned private IP and 1 EQXM-assigned public IP - because the VLAN IP is customer-assigned, EQXM API doesn't know about it, and so neither does CPEM
  5. CPEM returns EQXM-API assigned private and public IPs that it found not including VLAN IP, which it did not
  6. k8s uses the private as InternalIP missing the VLAN IP

Is that it?

The problem, then, is, how do we get one of the following:

I don't know how we do the first; if EQXM API doesn't know about it, how would CPEM discover it? We could get into all sorts of interesting logic using DaemonSets or other APIs, but that appears to be a bad path to walk down. We could tell CPEM about it, but as there is just one CPEM per cluster, we would need a big map of all of them.

For the second - which is what I think you were asking about - we could have some config that tells it to ignore private IPs. I don't know if that would create issues with elastic IPs and such, or multiple private IPs (which you can have). Would we ignore all private IPs found via EQXM API?

displague commented 2 years ago

@deitch I think the solution is closer to the second path. I think the cluster already has the internal IP in the node IP address list before CPEM starts. CPEM currently overrides the IP Address list. If CPEM patched the existing list, adding only what it discovered from the EM API, then all of the desired and possible addresses would be available.

So, the questions I see are:

I think the first option is a good option and first step.

deitch commented 2 years ago

Yeah, I don't at all want to get into the business of teaching CPEM how to look elsewhere for addresses. It is meant to bridge between k8s and EQXM, and that is what it should do. Injecting additional addresses also becomes very messy, not to mention that it would not know where to find addresses for new nodes that were added. Which would bring us back into some 3rd API (after k8s and EQXM) that we would need to teach it to query. Not somewhere we want to go.

The signature of InstanceMetadata() is here:

InstanceMetadata(ctx context.Context, node *v1.Node) (*InstanceMetadata, error)

Since it passes it the v1.Node as a parameter, whatever was set beforehand can be augmented, not just replaced.

The docs would look like:

For each node, CPEM adds any private and public IPs found via the EQXM API to existing IPs, while ensuring no duplicates occur. If you have additional IPs that should be on the node about which the EQXM API is not aware, for example, in the context of a customer-managed VLAN, you should make sure these are on the node at initialization, and CPEM will augment them, without replacing them.

The question is, would this work? kubelet has an option called --node-ip whose description is:

IP address (or comma-separated dual-stack IP addresses) of the node. If unset, kubelet will use the node's default IPv4 address, if any, or its default IPv6 address if it has no IPv4 addresses. You can pass '::' to make it prefer the default IPv6 address rather than the default IPv4 address.

Is that the private (aka "Internal") IP?

deitch commented 2 years ago

@displague we have been dealing with this one for quite some time. I would like to get this closed out. Do you think the solution I proposed in the previous message is correct?

displague commented 2 years ago

That approach sounds great, @deitch.

deitch commented 2 years ago

@displague I no longer am sure. The current behaviour actually should work correctly. Going to walk through it.

When cloud-provider calls InstancesV2.InstanceMetadata(), it gets the addresses. That happens here. It eventually uses that data to call updateNodeAddress():

        cnc.updateNodeAddress(ctx, newNode, instanceMetadata)

Note that it passes the instanceMetadata and the existing node information.

The implementation of updateNodeAddress is here.

Before applying those addresses, it checks if the node already has --node-ip provided here, and then uses that address (if any), along with the instance metadata addresses, to construct the final address list, preferring the node-provided IP:

    // If kubelet provided a node IP, prefer it in the node address list
    nodeIP, err := getNodeProvidedIP(node)
...
    if nodeIP != nil {
        nodeAddresses, err = cloudnodeutil.PreferNodeIP(nodeIP, nodeAddresses)
        if err != nil {
            klog.Errorf("Failed to update node addresses for node %q: %v", node.Name, err)
            return
        }
    }

PreferNodeIP() is defined here, but the really relevant part is this comment (and its implementation):

    // For every address supplied by the cloud provider that matches nodeIP, nodeIP is the enforced node address for
    // that address Type (like InternalIP and ExternalIP), meaning other addresses of the same Type are discarded.
    // See #61921 for more information: some cloud providers may supply secondary IPs, so nodeIP serves as a way to
    // ensure that the correct IPs show up on a Node object.

So if we already are supplying --node-ip, the comment implies that it should override. I am going to open an issue there, see this issue.

deitch commented 2 years ago

@displague can you confirm that we actually are providing --node-ip?

deitch commented 2 years ago

@displague we should try and move this one out. Do you have any updates?

displague commented 2 years ago

I wasn't aware that node-ip already had special treatment in the cloud-provider address list. Thanks for looking into that, @deitch.

I think https://github.com/kubernetes/cloud-provider/blob/5f25396ae5208d459715ab7642ec6b9a9144616e/node/helpers/address.go#L100-L103 confirms my suspicion that the node-ip is not preserved when the cloud-provider IP list is processed.

In the case of Equinix Metal, the user-supplied --node-ip should be given special treatment because the EM API is unaware of VLAN IP configurations. We have to trust that the user-supplied node-ip is valid and preferred.

deitch commented 2 years ago

Going back to the core of the issue. The problem definition was: given the use case of L2/VLAN, with private IP unknown to EQXM, how do we get Kubernetes to use the private VLAN address as the InternalIP of the node?

In order for that to work, either CCM has to be told about the IP, for which it only has access to the v1.Node struct as provided by Kubernetes, and the EQXM API. This IP is not known to the EQXM API, so it has to come from Kubernetes itself prior to invoking the CCM. In any case, CCM is a Deployment of replicas=1, it is not a DaemonSet, so it has no way of investigating the local IPs on each node, we have no interest in creating a DaemonSet, and in any case it is difficult to think of a sane algorithm that would do so without causing more problems than it solves.

This then boils down to:

  1. How does this IP get told to Kubernetes, specifically the kubelet on the node?
  2. How does CPEM get told about this IP via Kubernetes such that the response would cause Kubernetes to make it the primary InternalIP? That be one of:
    • kubernetes/cloud-provider already handles it such that it is the InternalIP no matter what CPEM does
    • kubernetes/cloud-provider passes it to CPEM, which needs to account for it in the response

I will run some experiments - unrelated to JuJu - to see what happens when you set --node-ip (need to think how to construct them without breaking anything; easiest probably is an extra private IP from the range).

At the same time, it would be helpful to know what the actual use case was that triggered this. Was it --node-ip? Is JuJu still relevant? Or are we after the generic, "let's get --node-ip working with CPEM"?

displague commented 2 years ago
  1. How does this IP get told to Kubernetes, specifically the kubelet on the node?

Was it --node-ip? Is JuJu still relevant? Or are we after the generic, "let's get --node-ip working with CPEM"?

It looks like @thebsdbox provided the hint in the description. --external sets this in motion. https://kubernetes.io/docs/reference/labels-annotations-taints/#alpha-kubernetes-io-provided-node-ip

  1. How does CPEM get told about this IP via Kubernetes such that the response would cause Kubernetes to make it the primary InternalIP? That be one of:

I think your bullets are correct. I would imagine the v1.Node passed to the CCM pod (regardless of where or how it was running) could reflect any Node. I imagine this v1.Node object will already have the InternalIP present, matching what the Node object in Kubernetes shows before the CCM runs and changes it.

displague commented 2 years ago

It is also worth keeping in mind that the user may have a public IP address (either Elastic, perhaps global) defined as the node-ip. These additional IPs will be available in the Equinix Metal API but the one that was indicated as --node-ip should stay an InternalIP and not get moved into the ExternalIP list (which would be the behavior today, based on the Equinix Metal API private field).

deitch commented 2 years ago

It is also worth keeping in mind that the user may have a public IP address (either Elastic, perhaps global) defined as the node-ip.

Oh, let's make it more complicated.

These additional IPs will be available in the Equinix Metal API

Yes, but we have no way of knowing that it is intended for the following 3 nodes. It might be BGP announced; it might be attached using the EQXM API, but only attached to one of them now.

As far as my understanding of "external IP" in kubernetes goes, it should be an IP owned by the node, not a higher-level LB IP.

the one that was indicated as --node-ip should stay an InternalIP and not get moved into the ExternalIP list

Agreed. Going to take time to set up the right tests for this.

and not get moved into the ExternalIP list (which would be the behavior today, based on the Equinix Metal API private field).

They would?

displague commented 2 years ago

and not get moved into the ExternalIP list (which would be the behavior today, based on the Equinix Metal API private field).

They would?

Say I have 198.51.100.1 and 203.0.113.1 (both from RFC-5735) assigned to my device, one from provisioning and one from elastic assignment. The Equinix Metal API (EMA?) will return these IPs with {"public": true, ...}.

I may have chosen 198.51.100.1 to use as -node-ip and so the v1.Node would include it in the InternalIP list.

CCM, as it is today, would reset the existing Internal and External IP lists, and then see these IPs as Public and assign them both to the External IP list.

https://github.com/equinix/cloud-provider-equinix-metal/blob/8c363dab030d0bf8424139aae946fd316a14a145/metal/devices.go#L107-L113

The correct behavior, in this case, would be to keep the existing 198.51.100.1 as an InternalIP. From a code change perspective, I think what we would do is ignore any addresses from EMA that are already found in the v1.NodeAddress list (regardless of Internal or External).

deitch commented 2 years ago

I believe this finally is fixed via #297