Closed klingerf closed 1 month ago
@mateiidavid Thanks for the feedback here! And sorry for the delay. I just tried to replicate and got the same results. But then I realized that it appears that the config.linkerd.io/debug-image-version
annotation isn't being honored.
I have an existing cluster that's running edge-24.7.5
. For one of my deployments, I added these annotations:
annotations:
config.linkerd.io/debug-image-version: git-b4aa3532
config.linkerd.io/enable-debug-sidecar: "true"
linkerd.io/inject: enabled
But when I look at the running pod, I see:
- image: cr.l5d.io/linkerd/debug:edge-24.7.5
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- "true"
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: linkerd-debug
So that explains why it works in an existing cluster -- we're not actually running it there :-/
And when I do actually run it, it fails:
$ kubectl logs deploy/hello linkerd-debug
Capturing on 'any'
tshark: Couldn't run dumpcap in child process: Operation not permitted
0 packets captured
It looks like that container is missing the NET_ADMIN
and NET_RAW
capabilities. If I add this to the container definition, it works:
securityContext:
capabilities:
add:
- NET_ADMIN
- NET_RAW
Which... make sense? Those are the same capabilities that we give to proxy-init. I'm not surprised that tshark needs them as well.
I can update the debug container template to include those capabilities, but I wonder why the previous version of this image didn't need them? I can also try to track town why the config.linkerd.io/debug-image-version
annotation isn't working, if that all works for you, @mateiidavid?
@klingerf nooo problem, thanks for the update!
I tried to do some research to see why Alpine would require CAP_ADMIN
and NET_RAW
. Didn't really get anything useful. Alpine is a bit more restrictive, so perhaps there is some piece of configuration that's just not the same between the two distros (Debian and Alpine). There are some differences in the groups created, for example.
I wanted to see if we can avoid elevating the container by granting permissions only for dumpcap
but that didn't work either. The root
user should also be part of the wireshark
group, which is sometimes a prerequisite to running tshark. I don't really see a way out of this, so we might have just have to add in the capabilities. For reference, here's a modified Dockerfile I used to poke around the container's runtime.
diff --git a/Dockerfile-debug b/Dockerfile-debug
index 522c1638b..5c0af7a8e 100644
--- a/Dockerfile-debug
+++ b/Dockerfile-debug
@@ -1,5 +1,5 @@
FROM alpine:3.20.1
-RUN apk add \
+RUN apk update && apk upgrade && apk add \
bind-tools \
curl \
iptables \
@@ -10,6 +10,11 @@ RUN apk add \
iproute2 \
lsof \
conntrack-tools \
+ procps \
+ coreutils \
+ libcap \
tshark
-ENTRYPOINT [ "tshark", "-i", "any" ]
+RUN setcap cap_net_raw,cap_net_admin=eip /usr/bin/dumpcap
+
+ENTRYPOINT ["tail", "-f", "/dev/null"]
I added procps
for ps
, coreutils
for tail
and libcap
to do setcap
and getcap
directly on the dumpcap binary. If I try to also add the capabilities in the image build process for tshark
I simply get an error that doesn't point to anything specific:
sh: tshark: Operation not permitted
I guess that's all to say, seems like granting the container caps through the k8s API is the simplest way forward.
Given the capabilities issue mentioned in previous comments, I don't think it's worth moving forward with this change right now. I'm going to close it, but we can always revisit later.
The debug container was previously built using Debian 12 (bookworm) as its base image, but
snyk
reports this vulnerability for that image:https://security.snyk.io/vuln/SNYK-DEBIAN12-ZLIB-6008963
By switching to the latest version of Alpine as a base image, the debug container scans cleanly with
snyk
and has access to new versions of installed utilities.Here's a rundown of some of the debug utilities included in the
edge-24.7.3
debug container:Compare that to the debug container build from this branch:
It's also the case that the debug container built from this branch is about half the size of the
edge-24.7.3
version: