linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.48k stars 1.26k forks source link

Better documentation on the expectations for `networkValidator.connectAddr` for `linkerd-control-plane` helm chart values #12797

Open djryanj opened 3 days ago

djryanj commented 3 days ago

What is the issue?

When deploying linkerd into an air-gapped or more restrictive environment using helm, the default value of 1.1.1.1:20001 for networkValidator.connectAddr (see: https://github.com/linkerd/linkerd2/blob/main/charts/linkerd-control-plane/values.yaml#L327) causes the init container to fail. The note in the chart is

we expect this to be rewritten

But there is no documentation anywhere that I can find on what it is expected to be rewritten to, or what the expectations of that endpoint are.

E.g., Is it a simple TCP connection which completes a 3-way handshake (this is presumed to be the case)? Does an https webserver fit the bill?

As an aside, the fact that 1.1.1.1:20001 actually works is weird, because documentation on what's running there is, again, difficult to find.

Also, this is sort of (but not really) related to https://linkerd.buoyant.io/t/error-linkerd-network-validator-in-air-gapped-installation/302 and was also a factor in my troubleshooting of https://github.com/linkerd/linkerd2/issues/7945.

How can it be reproduced?

Logs, error output, etc

2024-07-02T20:03:06.514474Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-07-02T20:03:06.514493Z DEBUG linkerd_network_validator: token="<redacted>\n"
2024-07-02T20:03:06.514500Z  INFO linkerd_network_validator: Connecting to 1.1.1.1:20001
2024-07-02T20:03:06.514929Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.244.1.51:34290
2024-07-02T20:03:16.515844Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

Observe firewall logs blocking this.

output of linkerd check -o short

N/A

Environment

Possible solution

Additional context

No response

Would you like to work on fixing this bug?

None

djryanj commented 1 day ago

So, I was wrong; the default value is fine. If linkerd-cni is working correctly, it will actually intercept that value and essentially reflect it back to the container which indicates that iptables is set up correctly (in other words, the Please ensure iptables rules are rewriting traffic as expected. error message is actually correct.)

In my case, the root cause was that I had cilium as my primary CNI and had also set cni.exclusive=false (the default) in the helm chart, which was deleting all additional CNI configurations (e.g., linkerd-cni) so it wasn't working at all.

However, I still think a case can be made to either remove this variable from this helm chart (in what cases is it needed?), or improve the documentation so id-ten-t's like me don't get lost.