Open wamak9 opened 3 weeks ago
Hey @wamak9,
Going to walk through the errors one-by-one to explain what they mean.
1. linkerd enabled pod logs
:
2024-06-06T14:03:42.566024937Z [ 61835.932270s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=direct connections must be mutually authenticated error.sources=[direct connections must be mutually authenticated] client.addr=10.109.192.187:34180 server.addr=10.109.194.85:4143
2024-06-06T14:03:43.072400780Z [ 61836.438591s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=direct connections must be mutually authenticated error.sources=[direct connections must be mutually authenticated] client.addr=10.109.194.225:43040 server.addr=10.109.194.85:4143
This indicates that the client of this service (10.109.192.187:34180
) is not mTLS'd but is trying to connect directly to the proxy's inbound port 4143
. Only another mTLS participant can initiate connections to 4143
. Typically, we might see this if the server is marked as opaque or is a multicluster gateway. If it opaque, then the client is either not injected with a proxy (in which case it should send directly to the proxy's port anyway) or misconfigured. It's hard to tell without more information.
2. linkerd destination logs
:
2024-06-06T13:02:43.467367168Z time="2024-06-06T13:02:43Z" level=error msg="failed to find LINKERD2_PROXY_INBOUND_LISTEN_ADDR environment variable in any container for given pod spec" addr=":8086" component=endpoint-profile-translator context-ns=retail context-pod=tapi-api-7c47bcf658-bcpqb remote="10.109.194.212:47462
An endpoint is marked as injected (probably has a control plane label) but it is not. Something went wrong where either the pod received the proxy with incomplete configuration, or the pod has been improperly annotated. This is not a typical occurence. Are you using any other features such as native sidecars?
3. linkerd proxy logs
: these logs are benign. They're from the destination service's proxy and it simply signals it cannot establish a connection since the socket has been bound yet. The proxy starts first before the destination container warms up, these only happen within the first 10-20s of the proxy's lifetime.
4. policy controller logs
: seems like the API server's conn is a bit wonky? I'm not sure this is necessarily related to what you're seeing. Is that a common occurence or is it just in a limited time interval when this happens.
I think it's helpful to isolate the logs that are relevant here, and that's probably the first set of logs. errno 111
simply means a connection cannot be established. Typically, it's because the socket's not listening. Sometimes you might see these out in the wild but if it's not a recurrence, or if it doesn't directly impact your traffic and services, it's safe to ignore.
From your original description:
IPs seen in the logs are releated to self-hosted Prometheus.
Where exactly? The one not being mutually authenticated, or...
For the connection refused in your linkerd enabled pod, is the server listening. Can you confirm that? Would be useful what is supposed to happen in that pod.
annotations: config.linkerd.io/default-inbound-policy: all-unauthenticated config.linkerd.io/image-pull-policy: Always config.linkerd.io/proxy-outbound-connect-timeout: "5"
On your self-hosted prometheus? Is it injected or not injected? If it's not, then the annotation won't have anything to configure. Can you do a kubectl get pods
so we can see how your set-up looks like?
IPs seen in the logs are releated to self-hosted Prometheus.
Where exactly? The one not being mutually authenticated, or...
For the connection refused in your linkerd enabled pod, is the server listening. Can you confirm that? Would be useful what is supposed to happen in that pod.
annotations: config.linkerd.io/default-inbound-policy: all-unauthenticated config.linkerd.io/image-pull-policy: Always config.linkerd.io/proxy-outbound-connect-timeout: "5"
On your self-hosted prometheus? Is it injected or not injected? If it's not, then the annotation won't have anything to configure. Can you do a
kubectl get pods
so we can see how your set-up looks like?
My self-hosted Prometheus is running on the same cluster and Linkerd is not injected on the Prometheus. So there are in total 300 pods which has linkerd injected and we see these errors on almost every single one of them.
k get pods -n linkerd
NAME READY STATUS RESTARTS AGE
linkerd-destination-7fbd959544-f6kpk 4/4 Running 0 7d4h
linkerd-identity-7cf66888f7-bb68b 2/2 Running 0 7d12h
linkerd-proxy-injector-64c5976b47-h7hg7 2/2 Running 0 7d20h
k get pods -n linkerd-viz
NAME READY STATUS RESTARTS AGE
metrics-api-55c8db4654-zsvt2 2/2 Running 2 (7d4h ago) 7d4h
prometheus-68ddcbf849-x4kxs 2/2 Running 0 7d4h
tap-5c6788c47d-pz5qm 2/2 Running 2 (7d4h ago) 7d4h
tap-injector-674d8d486f-mq4zx 2/2 Running 0 7d4h
web-5f8cd8d88f-xbnq6 2/2 Running 0 7d4h
For the connection refused in your linkerd enabled pod, is the server listening. Can you confirm that? Would be useful what is supposed to happen in that pod.
I am not sure what this means, when you say server listening are we talking about proxy and init ?
klogs e-yw0qc1c3x6kl5n1ifx-78854d4c58gcnxx -n default -c linkerd-init
2024-06-12T06:35:56.821771350Z time="2024-06-12T06:35:56Z" level=info msg="/sbin/iptables-save -t nat"
2024-06-12T06:35:56.918594899Z time="2024-06-12T06:35:56Z" level=info msg="# Generated by iptables-save v1.8.8 on Wed Jun 12 06:35:56 2024\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\nCOMMIT\n# Completed on Wed Jun 12 06:35:56 2024\n"
2024-06-12T06:35:56.918749699Z time="2024-06-12T06:35:56Z" level=info msg="/sbin/iptables -t nat -N PROXY_INIT_REDIRECT"
2024-06-12T06:35:56.920661900Z time="2024-06-12T06:35:56Z" level=info msg="/sbin/iptables -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4190,4191,8090 -j RETURN -m comment --comment proxy-init/ignore-port-4190,4191,8090/1718174156"
2024-06-12T06:35:56.922512901Z time="2024-06-12T06:35:56Z" level=info msg="/sbin/iptables -t nat -A PROXY_INIT_REDIRECT -p tcp -j REDIRECT --to-port 4143 -m comment --comment proxy-init/redirect-all-incoming-to-proxy-port/1718174156"
2024-06-12T06:35:57.017541950Z time="2024-06-12T06:35:57Z" level=info msg="/sbin/iptables -t nat -A PREROUTING -j PROXY_INIT_REDIRECT -m comment --comment proxy-init/install-proxy-init-prerouting/1718174156"
2024-06-12T06:35:57.019529751Z time="2024-06-12T06:35:57Z" level=info msg="/sbin/iptables -t nat -N PROXY_INIT_OUTPUT"
2024-06-12T06:35:57.021036252Z time="2024-06-12T06:35:57Z" level=info msg="/sbin/iptables -t nat -A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -j RETURN -m comment --comment proxy-init/ignore-proxy-user-id/1718174156"
2024-06-12T06:35:57.022927253Z time="2024-06-12T06:35:57Z" level=info msg="/sbin/iptables -t nat -A PROXY_INIT_OUTPUT -o lo -j RETURN -m comment --comment proxy-init/ignore-loopback/1718174156"
2024-06-12T06:35:57.025502354Z time="2024-06-12T06:35:57Z" level=info msg="/sbin/iptables -t nat -A PROXY_INIT_OUTPUT -p tcp --match multiport --dports 8090 -j RETURN -m comment --comment proxy-init/ignore-port-8090/1718174156"
2024-06-12T06:35:57.118326702Z time="2024-06-12T06:35:57Z" level=info msg="/sbin/iptables -t nat -A PROXY_INIT_OUTPUT -p tcp -j REDIRECT --to-port 4140 -m comment --comment proxy-init/redirect-all-outgoing-to-proxy-port/1718174156"
2024-06-12T06:35:57.120459303Z time="2024-06-12T06:35:57Z" level=info msg="/sbin/iptables -t nat -A OUTPUT -j PROXY_INIT_OUTPUT -m comment --comment proxy-init/install-proxy-init-output/1718174156"
2024-06-12T06:35:57.122458304Z time="2024-06-12T06:35:57Z" level=info msg="/sbin/iptables-save -t nat"
2024-06-12T06:35:57.218168653Z time="2024-06-12T06:35:57Z" level=info msg="# Generated by iptables-save v1.8.8 on Wed Jun 12 06:35:57 2024\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\n:PROXY_INIT_OUTPUT - [0:0]\n:PROXY_INIT_REDIRECT - [0:0]\n-A PREROUTING -m comment --comment \"proxy-init/install-proxy-init-prerouting/1718174156\" -j PROXY_INIT_REDIRECT\n-A OUTPUT -m comment --comment \"proxy-init/install-proxy-init-output/1718174156\" -j PROXY_INIT_OUTPUT\n-A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -m comment --comment \"proxy-init/ignore-proxy-user-id/1718174156\" -j RETURN\n-A PROXY_INIT_OUTPUT -o lo -m comment --comment \"proxy-init/ignore-loopback/1718174156\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m multiport --dports 8090 -m comment --comment \"proxy-init/ignore-port-8090/1718174156\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m comment --comment \"proxy-init/redirect-all-outgoing-to-proxy-port/1718174156\" -j REDIRECT --to-ports 4140\n-A PROXY_INIT_REDIRECT -p tcp -m multiport --dports 4190,4191,8090 -m comment --comment \"proxy-init/ignore-port-4190,4191,8090/1718174156\" -j RETURN\n-A PROXY_INIT_REDIRECT -p tcp -m comment --comment \"proxy-init/redirect-all-incoming-to-proxy-port/1718174156\" -j REDIRECT --to-ports 4143\nCOMMIT\n# Completed on Wed Jun 12 06:35:57 2024\n"
This is not a typical occurence. Are you using any other features such as native sidecars? I am not aware of such thing
What is the issue?
I keep seeing
os error 111
issue on linkerd and have no idea how to fix it.More logs below. IPs seen in the logs are releated to self-hosted Prometheus. There are two sets on Prometheus running, one is from Linkerd and one is self hosted by us. Tried to add bunch of annotations
How can it be reproduced?
Current version running is stable-2.14.9. Enable linkerd on one of the namespace and then deploy. Deploy prometheus on different namepace with no linkeerd annotation.
Logs, error output, etc
Linkerd enabled POD logs.
linkerd-destination logs
linkerd Proxy logs
Linkerd Policy Logs
output of
linkerd check -o short
Environment
Possible solution
N/A
Additional context
So, I tried to add linkerd to Monitoring namespace which is running Prometheus. But that did not help and I am seeing way too many error logs in Prometheus on the same OS error.
I did add an authorization policy, so our self installed Prometheus can scrape metrics.
Would you like to work on fixing this bug?
maybe