Closed thebsdbox closed 2 years ago
I spent the last hour digging into the source for cloud-provider, trying to see why this happens, or even how it can work at all. In the end, I am not sure it should.
Look at this description of the annotation:
When kubelet is started with the "external" cloud provider, it sets this annotation on the Node to denote an IP address set from the command line flag (--node-ip). This IP is verified with the cloud provider as valid by the cloud-controller-manager
If you set an explicit IP address using --node-ip
, then it must be verified with the cloud provider via CCM. As I read this, it is illegitimate to set an IP that isn't valid with the cloud provider. To be fair, I don't know why you would, as that would make the node unroutable and useless.
If juju is setting --node-ip
to something invalid for the cloud provider (in this case, EQXM), then it is doing something wrong.
From a different perspective, having read the fan networking docs, my understanding is that this should affect the containers only, not the underlying nodes. So why are we setting --node-ip
to anything?
@deitch The reason is the following. I have to deploy components of the control plane into the linux containers so that I don't over-consume physical nodes running only one piece of it (kube-controller, etcd etc). Each container should have an IP address and it can be assigned with fan address only. Because there is no concept of spaces for now - all subnets available on the node fall under one default space and the "ingress_address" array of the worker only contains the fan address. I can't force it to contain multiple addresses manually.
The charm reads the content of that array and finds only one address which happens to be fan address. This is address becomes the "node-ip" for kubelet.
You actually can deploy into a container so you manage resources, while sharing the IP space with the underlying host. Since containers really are just constructs combined of chroot and namespaces and cgroups, you could put them in containers that do not have a dedicated network namespace. In docker terms, it is just --network host
, but it is the same thing.
You may have a good reason, but I don't know that we can do anything with it. It sounds to me like the usage is in conflict with how Kubernetes expects things to work.
I was looking at a cluster configuration with @cprivitere where each node had:
The VLAN is used for Node addressing in order to stretch the cluster to other environments.
When the CPEM registers node addresses it currently overrides the InternalIP
address (192.168/16) with the unused EM Project IP (10/8).
In this environment, it would be desirable to disable the InternalIP
describing function of the cloud provider:
An argument like --without-node-internal-ip
(fit to our current naming practice) could be how we deliver this.
Thoughts, @deitch.
I am not totally sure I understand what you are asking for @displague .
The way it works, CPEM just gets all of the addresses and appends them to the list provided to InstanceMetadata()
. In the case you described, it should find 1 of type v1.NodeExternalIP
and 2 of type v1.NodeInternalIP
. That assumes that the VLAN address is returned from the EQXM API with address.Public == false
.
So k8s gets 3 addresses. What it chooses to do with it is largely its choice, is it not? Is there some official way to tell k8s, "I am giving you 3 private addresses, this is the one you pick for InternalIP
?
@deitch in the case of a VLAN address, EM API doesn't manage the address and knows nothing about it. The user defined the VLAN address space as the InternalIP range during kubeadm install, for example. InternalIP is set to 192.168.x.x before CPEM is installed.
These nodes are using Hybrid bonding modes, so the EM API is going to include a 10.x.x.x address and a public address if one was requested, or by default.
In this Hybrid case, the user could prefer to keep the original InternalIP, the one that works over the VLAN, attached to stretch cluster nodes.
The feature request then is to have a way to tell CPEM to skip over any public=false, InternalIP, overrides it would have done.
Ah, I didn't think about that part. If we update my earlier comment, then it looks like this.
InstanceMetadata()
InternalIP
All is good
(new stuff in bold)
InstanceMetadata()
InternalIP
missing the VLAN IPIs that it?
The problem, then, is, how do we get one of the following:
I don't know how we do the first; if EQXM API doesn't know about it, how would CPEM discover it? We could get into all sorts of interesting logic using DaemonSet
s or other APIs, but that appears to be a bad path to walk down. We could tell CPEM about it, but as there is just one CPEM per cluster, we would need a big map of all of them.
For the second - which is what I think you were asking about - we could have some config that tells it to ignore private IPs. I don't know if that would create issues with elastic IPs and such, or multiple private IPs (which you can have). Would we ignore all private IPs found via EQXM API?
@deitch I think the solution is closer to the second path. I think the cluster already has the internal IP in the node IP address list before CPEM starts. CPEM currently overrides the IP Address list. If CPEM patched the existing list, adding only what it discovered from the EM API, then all of the desired and possible addresses would be available.
So, the questions I see are:
I think the first option is a good option and first step.
Yeah, I don't at all want to get into the business of teaching CPEM how to look elsewhere for addresses. It is meant to bridge between k8s and EQXM, and that is what it should do. Injecting additional addresses also becomes very messy, not to mention that it would not know where to find addresses for new nodes that were added. Which would bring us back into some 3rd API (after k8s and EQXM) that we would need to teach it to query. Not somewhere we want to go.
The signature of InstanceMetadata()
is here:
InstanceMetadata(ctx context.Context, node *v1.Node) (*InstanceMetadata, error)
Since it passes it the v1.Node
as a parameter, whatever was set beforehand can be augmented, not just replaced.
The docs would look like:
For each node, CPEM adds any private and public IPs found via the EQXM API to existing IPs, while ensuring no duplicates occur. If you have additional IPs that should be on the node about which the EQXM API is not aware, for example, in the context of a customer-managed VLAN, you should make sure these are on the node at initialization, and CPEM will augment them, without replacing them.
The question is, would this work? kubelet has an option called --node-ip
whose description is:
IP address (or comma-separated dual-stack IP addresses) of the node. If unset, kubelet will use the node's default IPv4 address, if any, or its default IPv6 address if it has no IPv4 addresses. You can pass '::' to make it prefer the default IPv6 address rather than the default IPv4 address.
Is that the private (aka "Internal") IP?
@displague we have been dealing with this one for quite some time. I would like to get this closed out. Do you think the solution I proposed in the previous message is correct?
That approach sounds great, @deitch.
@displague I no longer am sure. The current behaviour actually should work correctly. Going to walk through it.
When cloud-provider calls InstancesV2.InstanceMetadata(), it gets the addresses. That happens here. It eventually uses that data to call updateNodeAddress():
cnc.updateNodeAddress(ctx, newNode, instanceMetadata)
Note that it passes the instanceMetadata
and the existing node information.
The implementation of updateNodeAddress
is here.
Before applying those addresses, it checks if the node already has --node-ip
provided here, and then uses that address (if any), along with the instance metadata addresses, to construct the final address list, preferring the node-provided IP:
// If kubelet provided a node IP, prefer it in the node address list
nodeIP, err := getNodeProvidedIP(node)
...
if nodeIP != nil {
nodeAddresses, err = cloudnodeutil.PreferNodeIP(nodeIP, nodeAddresses)
if err != nil {
klog.Errorf("Failed to update node addresses for node %q: %v", node.Name, err)
return
}
}
PreferNodeIP()
is defined here, but the really relevant part is this comment (and its implementation):
// For every address supplied by the cloud provider that matches nodeIP, nodeIP is the enforced node address for
// that address Type (like InternalIP and ExternalIP), meaning other addresses of the same Type are discarded.
// See #61921 for more information: some cloud providers may supply secondary IPs, so nodeIP serves as a way to
// ensure that the correct IPs show up on a Node object.
So if we already are supplying --node-ip
, the comment implies that it should override. I am going to open an issue there, see this issue.
@displague can you confirm that we actually are providing --node-ip
?
@displague we should try and move this one out. Do you have any updates?
I wasn't aware that node-ip
already had special treatment in the cloud-provider address list. Thanks for looking into that, @deitch.
I think https://github.com/kubernetes/cloud-provider/blob/5f25396ae5208d459715ab7642ec6b9a9144616e/node/helpers/address.go#L100-L103 confirms my suspicion that the node-ip
is not preserved when the cloud-provider IP list is processed.
In the case of Equinix Metal, the user-supplied --node-ip
should be given special treatment because the EM API is unaware of VLAN IP configurations. We have to trust that the user-supplied node-ip
is valid and preferred.
Going back to the core of the issue. The problem definition was: given the use case of L2/VLAN, with private IP unknown to EQXM, how do we get Kubernetes to use the private VLAN address as the InternalIP
of the node?
In order for that to work, either CCM has to be told about the IP, for which it only has access to the v1.Node
struct as provided by Kubernetes, and the EQXM API. This IP is not known to the EQXM API, so it has to come from Kubernetes itself prior to invoking the CCM. In any case, CCM is a Deployment
of replicas=1
, it is not a DaemonSet
, so it has no way of investigating the local IPs on each node, we have no interest in creating a DaemonSet
, and in any case it is difficult to think of a sane algorithm that would do so without causing more problems than it solves.
This then boils down to:
InternalIP
? That be one of:
kubernetes/cloud-provider
already handles it such that it is the InternalIP
no matter what CPEM doeskubernetes/cloud-provider
passes it to CPEM, which needs to account for it in the responseI will run some experiments - unrelated to JuJu - to see what happens when you set --node-ip
(need to think how to construct them without breaking anything; easiest probably is an extra private IP from the range).
At the same time, it would be helpful to know what the actual use case was that triggered this. Was it --node-ip
? Is JuJu still relevant? Or are we after the generic, "let's get --node-ip
working with CPEM"?
- How does this IP get told to Kubernetes, specifically the kubelet on the node?
Was it --node-ip? Is JuJu still relevant? Or are we after the generic, "let's get --node-ip working with CPEM"?
It looks like @thebsdbox provided the hint in the description. --external
sets this in motion.
https://kubernetes.io/docs/reference/labels-annotations-taints/#alpha-kubernetes-io-provided-node-ip
- How does CPEM get told about this IP via Kubernetes such that the response would cause Kubernetes to make it the primary InternalIP? That be one of:
I think your bullets are correct. I would imagine the v1.Node passed to the CCM pod (regardless of where or how it was running) could reflect any Node. I imagine this v1.Node object will already have the InternalIP present, matching what the Node object in Kubernetes shows before the CCM runs and changes it.
It is also worth keeping in mind that the user may have a public IP address (either Elastic, perhaps global) defined as the node-ip
. These additional IPs will be available in the Equinix Metal API but the one that was indicated as --node-ip
should stay an InternalIP
and not get moved into the ExternalIP
list (which would be the behavior today, based on the Equinix Metal API private
field).
It is also worth keeping in mind that the user may have a public IP address (either Elastic, perhaps global) defined as the node-ip.
Oh, let's make it more complicated.
These additional IPs will be available in the Equinix Metal API
Yes, but we have no way of knowing that it is intended for the following 3 nodes. It might be BGP announced; it might be attached using the EQXM API, but only attached to one of them now.
As far as my understanding of "external IP" in kubernetes goes, it should be an IP owned by the node, not a higher-level LB IP.
the one that was indicated as --node-ip should stay an InternalIP and not get moved into the ExternalIP list
Agreed. Going to take time to set up the right tests for this.
and not get moved into the ExternalIP list (which would be the behavior today, based on the Equinix Metal API private field).
They would?
and not get moved into the ExternalIP list (which would be the behavior today, based on the Equinix Metal API private field).
They would?
Say I have 198.51.100.1 and 203.0.113.1 (both from RFC-5735) assigned to my device, one from provisioning and one from elastic assignment. The Equinix Metal API (EMA?) will return these IPs with {"public": true, ...}
.
I may have chosen 198.51.100.1 to use as -node-ip
and so the v1.Node
would include it in the InternalIP
list.
CCM, as it is today, would reset the existing Internal and External IP lists, and then see these IPs as Public and assign them both to the External IP list.
The correct behavior, in this case, would be to keep the existing 198.51.100.1 as an InternalIP.
From a code change perspective, I think what we would do is ignore any addresses from EMA that are already found in the v1.NodeAddress
list (regardless of Internal or External).
I believe this finally is fixed via #297
So AFAICT a Juju Kubernetes deployment uses a fan network (https://juju.is/docs/olm/fan-container-networking) which appears to be an additional network on top of the existing network to simplify address usage.
It appears that the CCM and/or Kubernetes Controller code is having trouble reconciling the FAN network address with that of the node itself?
The main error being
failed to find kubelet node IP from cloud provider
If we look at the nodes:
It looks like
alpha.kubernetes.io/provided-node-ip: 252.6.0.1
is being compared with what is expected from the EM API?At the moment the Juju clusters are stuck as they're all tainted as unready and the CCM/Controller code can't proceed.
For testing I removed the annotation and the following happened: