khappucino commented 5 years ago

Feature Request

We would like the ability to detect when nodes have valid cni plugin configurations before launching pods on that node.

What problem are you trying to solve?

Definitions:

.conf is a file that is used when you only have 1 cni plugin to apply
.conflist is a file that is used when you are chaining multiple cni plugins
The CNI system will always choose the first lexicographical ordered .conf/.conflist file and ignore the rest

Conditions: Two CNI plugins (one of them is the linkerd-cni plugin) and a launching Pod that wants to be meshed. Requirement: We would like Linkerd enabled Pods to have TLS communication Situation: We have told our service teams to communicate in plain text since Linkerd will take care of in transit TLS to and from meshed pods

The time when CNI pods run and write their files to disk is not deterministic.

Scenario 1 - Ideal Situation (Pods launched and meshed)

1) A Calico daemonset is deployed and has successfully written its .conflist to the host machine. note On the target host, no pods that require Linkerd CNI plugin have been launched 2) A Linkerd-CNI daemonset is deployed and has successfully added it's configuration to the existing .conflist on the host machine. 3) A Pod that we want the Linkerd proxy on is launched. 4) The CNI plugin chain loads the combined .conflist on the host machine. 5) The CNI chain completes successfully 6) The Pod is launched with an ipAddress (from Calico) and iptables (from Linkerd-CNI)

Scenario 2 - Non-Ideal Situation (Pods not launching)

1) A Linkerd-CNI daemonset is deployed and has successfully written it's .conf file to the host machine. 2) A Calico daemonset is deployed and has successfully written its .conflist to the host machine. note On the target host, no pods that require Linkerd CNI plugin have been launched 3) There are now two cni files on the host machine:

01-linkerd-cni.conf
10-calico.conflist 4) A Pod that we want the Linkerd proxy on is launched. 5) The CNI plugin chain loads the Linkerd-CNI .conf file (because it is first in lexicographical order) 6) The CNI chain fails since the podIps are never assigned (because the calico plugin did not run) 7) The Pod is stuck in a container-creating state =( 8) There is not a deterministic way to recover the situation on the node.
Killing daemonset pods do not guarantee the order that they write their .conf/.conflist files to the host 9) ALL new pods on this Node will get stuck in container-creating state
They may get stuck for different reasons

Scenario 3 - Scary-Non-Ideal Situation (Pods launching but with no Linkerd proxy iptables)

1) A Calico daemonset is deployed and has successfully written its conflist to the host machine. note On the target host, no pods that require Linkerd CNI plugin have been launched 2) A Linkerd-CNI daemonset is deployed BUT it has NOT written it's .conf file to the host machine yet. 3) There are is now only one cni file on the host machine:

10-calico.conflist 4) A Pod that we want the Linkerd proxy on is launched. 5) The CNI plugin chain loads the Calico .conflist file (because it is the only .conflist on the host) 6) The CNI chain completes successfully 7) The Pod is launched with an ipAddress (from Calico) BUT there are NO iptables (from Linkerd-CNI)
Any pods that launch before the Linkerd-CNI daemonset updates the .conflist file are going to be missing their iptables 8) The Linkerd-CNI daemonset finally updates the existing .conflist and creates a valid cni chain 9) Launch another Pod 10) This pod will have the correct ipAddress and iptables applied since we have a valid .conflist now.
Any pods now launched will have the ipAddress and iptables applied 11) Now we are in a scenario where we have some ACTIVE pods that do not have iptables written.
The intent is for ALL of the pods is to have the Linkerd proxy on them to apply TLS

This problem is specific to using CNI plugins to apply iptables.

It seems that these cni pods go ready before they actually write their config files to the host machines. Different cni plugins will write to the host in non-deterministic order. This would normally not be a problem.

It becomes a problem when different cni plugins do not correctly incorporate previously existing conf and conflists files (from other CNI plugins).

For example if we launch the Linkerd CNI pod and it writes to the host first, we have noticed that the Calico CNI plugin seems to ignore the existing .conf file. Both files will exist on disk but the CNI system will choose the Linkerd CNI config only. This results in a stalled Pod since there are no ipAddresses applied (Since the Calico CNI plugin gets skipped)

There can be a race condition between when the Linkerd-CNI plugin writes it's .conf file and when pods get deployed on the node. The result is that the CNI plugin chain will complete (depending on the order) without error and the pod launches. The downside is that even though we have a valid Linkerd CNI plugin installed, we did not actually write the iptables to the newly launched pod. This is problematic because if you have two meshed services that are expecting to communicate over TLS, you could end up speaking plaintext over the channel (Since the iptables were never written all inbound request end up routing to the container's listening ports and not the Linkerd proxy). I don't think that currently you wouldn't be able to detect this scenario without explicitly checking what services are meshed.

Any time there is a new Node deployed in the cluster this RACE condition between CNI plugins and pods spawning will occur.

It would be great to have the ability to detect situations like this and deny or kill existing deployments of those pods if they are not configured correctly.

How should the problem be solved?

It would be great if all CNI plugins actually took other CNI plugins into account. Even if this was true we believe that the race condition that occurs when new Nodes spin up will still be a thing.

It may make sense to have a daemon set that can monitor the pods on each node for correct iptables (assuming they are configured to be meshed). These pods would need CAP_NET_ADMIN in order to directly detect if the ip tables are written correctly.

What do you want to happen? Add any considered drawbacks.

To rule out multiple the CNI plugin race condition Cluster Administrators would manually insure that there is only 1 custom .conflist file deployed. This is not a general solution though. ~The issue is that there may still be a race condition between when pods spin up on a new Node and when the CNI plugin gets written to disk.~

It would be great in general if we could reject any pods that come up in this scary configuration.

Pods that spin up in the no iptables configuration should self heal if we bounce them since eventually the Linkerd-CNI plugin will update the correct .conflist on the host.
The scenario where the Linkerd-CNI plugin launches first is really problematic since no pods will work on the node.

Any alternatives you've considered?

I'm out of ideas for alternative solutions, the idea of using Admission controllers however this is something that manifests during the pod bootup at runtime.

Is there another way to solve this problem that isn't as good a solution? We could also log when the Linkerd CNI plugin actually gets invoked and make sure they correlate with meshed pod launches. This seems problematic for many reasons.

How would users interact with this feature?

It would be nice if pods failed to launch or are terminated after it has been determined to be in a bad configuration.

If you can, explain how users will be able to use this. Maybe some sample CLI output? It would be great if it was transparent.

grampelberg commented 5 years ago

Some thoughts, without any deep understanding of CNI ordering that I'll need to go research some more.

Scenario 1

In this instance, the kubelet has been configured with --network-plugin=cni. The daemonset starts up, looks at the config and does an in-place update. That sounds right to me. We should have a check that verifies the config has been written successfully on all the instances. Feels like this should be a readiness check on the daemonset pod itself (exec not http).

Scenario 2

In this instance, the kubelet is originally configured with --network-plugin=kubenet. The daemonset starts up, looks for a config, doesn't find one and writes its own. This feels like a weird use case as the daemonset won't actually do something in this situation. My proposal would be to do nothing and have the readiness check fail. check would error out and you'd get some warnings that you're trying to do something your cluster doesn't support.

Scenario 3

This is the scenario that scares me the most. Ignoring any timing issues re. calico, it is entirely possible for someone to inject and skip the sidecar. They'll then be totally oblivious to the lack of things like metrics and TLS. We actually ran into this previously when the initContainer would fail silently (since fixed).

Ideally, in this situation, inject would error out and explain that your config is invalid. @klingerf you're tackling the config/inject updates. What do you think?

In the non-ideal situation, that should be planned for anyways, there's either a check for this situation or an audit that explains whether the iptables rules are setup correctly.

codeman9 commented 5 years ago

Scenario 2, as described by @khappucino , is also assuming --network-plugin=cni, just that the "real" cni plugin that assigns IP addresses to pods hasn't been installed yet.

grampelberg commented 5 years ago

@codeman9 pods start up even though there's no config?

codeman9 commented 5 years ago

@grampelberg the cluster in scenario 2 is started up: 1) without a cni plugin configuration or, 2) the configuration has been deleted somehow or, 3) the kubelet has been restarted to point to a different cni configuration directory that is empty.

khappucino commented 5 years ago

One possible option may include cluster admins constructing a Node image that contains a single composite config file with all of the cni plugins in the list. This image would also have the relevant binaries to support the cni-plugins that are called out in the composite config file. This way there is no uncertainty with respect to what cni configuration is on any given node. Each node will boot up with the complete set of config file and bin files.

A more flexible alternative may be to have a single CNI daemonset that encapsulates all of the separate cni dependencies at once. ie. One daemonset pod that writes the combined config file and the associated binary files.

Another possibility is marking newly spawned nodes as tainted until we can determine if the node has the correct configuration deployed (also setting pod deployment priority even though it is best effort).

khappucino commented 5 years ago

We got around this issue by baking the cni binaries (linkerd2-cni and flannel) into the node VM base images. This removed any uncertainty of component launch order and availability of conf/binary files.

alex-berger commented 4 years ago

Actually, this is a real problem out in the wild 😞, especially when using managed kubernets cluster nodes (e.g. AWS EKS managed node groups) resp. if you don't want to maintain your own VM base images. On EKS there is a daemonset called aws-node which is the AWS VPC CNI Plugin, and that plugin might override the changes that linkerd-cni plugin wrote to the .conflist file. Note, IMHO this is clearly a serious bug and definitively not a feature request. It is not acceptable for production clusters that such "configuration race conditions" exist as it is really hard and time consuming to figure out why some pods in the mesh are not working just because some daemonset's pods have been started in different orders.

frimik commented 2 years ago

We got around this issue by baking the cni binaries (linkerd2-cni and flannel) into the node VM base images. This removed any uncertainty of component launch order and availability of conf/binary files.

We did the same, but alot of time has passed so I was now revisiting this and in the progress of upgrading all components - My hope was that the chaining issues had been fixed in both aws-cni and linkerd-cni to move towards using upstream Helm installs of both CNI components but no luck.

The aws-vpc-cni-k8s issue is not really solved, as described

frimik commented 2 years ago

So, after digging deeper.. and finding that the only good attempts at chaining properly are in the edge helm charts, so I tried linkerd2-cni edge-22.6.2;

But for aws-vpc + linkerd2-cni to break, the only thing that needs to happen is for the aws-node pod to restart after the linkerd-cni pod has started.

aws-node starts..
linkerd2-cni starts ...
(everything works)
aws-node restarts (writes a new 10-aws.conflist file ...)

Installing CNI configuration in "chained" mode for /host/etc/cni/net.d/10-aws.conflist
Created CNI config /host/etc/cni/net.d/10-aws.conflist
Setting up watches.
Watches established.
# all fine ...
# aws-node starting again, recreating its' own 10-aws.conflist
Detected change in /host/etc/cni/net.d/: DELETE 10-aws.conflist
Detected change in /host/etc/cni/net.d/: CREATE 10-aws.conflist
# linkerd2-cni ignoring ...
Ignoring event: CREATE /host/etc/cni/net.d/10-aws.conflist; no real changes detected

Quick fix is do filter DELETE action in https://github.com/linkerd/linkerd2/blob/main/cni-plugin/deployment/scripts/install-cni.sh#L311

Otherwise the sha is calculated on the DELETE event, because the DELETE+CREATE are so immediate that the file will always be instantly recreated before this step is reached:

        if [ "$action" != "DELETE" -a -e "$directory/$filename" ]; then

mateiidavid commented 2 years ago

@frimik thanks for the report! We appreciate people testing out the new changes. Slight oversight on my side for missing that the file is immediately created. I put up a fix here: https://github.com/linkerd/linkerd2/pull/8778

mateiidavid commented 1 year ago

Hi folks. This issue has been around for a very long time. We've also had similar reports filed in numerous other issues. For stable-2.13.0, the CNI plugin will handle race conditions more gracefully and efficiently. We're still looking into ways on how this can be improved, we will likely open up new issues though. I've given a more in-depth explanation of our reasoning in https://github.com/linkerd/linkerd2/issues/8120

A final update. We're on track to release stable-2.13.0. In my previous update, I mentioned that we've introduced two new mechanisms to deal with race conditions for CNI plugins:

a file watcher (to ensure linkerd's cni plugin can be installed successfully when other plugins exist in the cluster)

a network validator (to ensure iptables has been set-up correctly in a network namespace).

For now, these two mechanisms should be enough to significantly reduce any race conditions that arise when nodes are rebooted or restarted. While the network validator will not restart pods, it will block any misconfigured pod from running; moving the failure point at a different layer will allow operators or engineers to diagnose issues better and to take manual intervention when required.

While restarting misconfigured pods through a controller was part of the original scope of this issue, it won't be part of the stable release. A new controller needs a bit more thought and design before being introduced. We'll likely open up a separate issue to discuss the approach and design.

I will be closing this issue for now. As of stable-2.13.0 the handling of race conditions will be improved. After we trial out the network validator and get more feedback, we'll figure out where the automation fits in. Hopefully by then the Kubernetes community will have also figured out a better way to bootstrap CNIs before node provisioning.

Any questions, concerns, or feedback, feel free to open up a new issue or ping me in an existing one.

Thanks!

linkerd / linkerd2

Detecting misbehaving CNI plugin configurations at runtime #2219

Feature Request

What problem are you trying to solve?

Scenario 1 - Ideal Situation (Pods launched and meshed)

Scenario 2 - Non-Ideal Situation (Pods not launching)

Scenario 3 - Scary-Non-Ideal Situation (Pods launching but with no Linkerd proxy iptables)

How should the problem be solved?

What do you want to happen? Add any considered drawbacks.

Any alternatives you've considered?

How would users interact with this feature?

Scenario 1

Scenario 2

Scenario 3