linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.63k stars 1.28k forks source link

Switch debug container to alpine base image #12863

Closed klingerf closed 1 month ago

klingerf commented 2 months ago

The debug container was previously built using Debian 12 (bookworm) as its base image, but snyk reports this vulnerability for that image:

https://security.snyk.io/vuln/SNYK-DEBIAN12-ZLIB-6008963

By switching to the latest version of Alpine as a base image, the debug container scans cleanly with snyk and has access to new versions of installed utilities.

Here's a rundown of some of the debug utilities included in the edge-24.7.3 debug container:

# curl --version
curl 7.88.1
# dig -v
DiG 9.18.24-1-Debian
# iptables --version
iptables v1.8.9 (legacy)
# iptables-nft --version
iptables v1.8.9 (nf_tables)
# jq --version
jq-1.6
# nghttp --version
nghttp nghttp2/1.52.0
# tcpdump --version
tcpdump version 4.99.3
libpcap version 1.10.3 (with TPACKET_V3)
OpenSSL 3.0.13 30 Jan 2024
# nstat --version
nstat utility, iproute2-6.1.0
# lsof -v
revision: 4.95.0
# conntrack --version
conntrack v1.4.7 (conntrack-tools)
# tshark --version
TShark (Wireshark) 4.0.11 (Git v4.0.11 packaged as 4.0.11-1~deb12u1).

Compare that to the debug container build from this branch:

# curl --version
curl 8.8.0
# dig -v
DiG 9.18.27
# iptables-legacy --version
iptables v1.8.10 (legacy)
# iptables --version
iptables v1.8.10 (nf_tables)
# jq --version
jq-1.7.1
# nghttp --version
nghttp nghttp2/1.62.1
# tcpdump --version
tcpdump version 4.99.4
libpcap version 1.10.4 (with TPACKET_V3)
OpenSSL 3.3.1 4 Jun 2024
# nstat --version
nstat utility, iproute2-6.9.0
# lsof -v
revision: 4.99.3
# conntrack --version
conntrack v1.4.8 (conntrack-tools)
# tshark --version
TShark (Wireshark) 4.2.5 (Git commit 798e06a0f7be).

It's also the case that the debug container built from this branch is about half the size of the edge-24.7.3 version:

cr.l5d.io/linkerd/debug         git-b4aa3532            7817231cb383   22 minutes ago   159MB
cr.l5d.io/linkerd/debug         edge-24.7.3             429fb197e718   25 hours ago     314MB
klingerf commented 2 months ago

@mateiidavid Thanks for the feedback here! And sorry for the delay. I just tried to replicate and got the same results. But then I realized that it appears that the config.linkerd.io/debug-image-version annotation isn't being honored.

I have an existing cluster that's running edge-24.7.5. For one of my deployments, I added these annotations:

      annotations:
        config.linkerd.io/debug-image-version: git-b4aa3532
        config.linkerd.io/enable-debug-sidecar: "true"
        linkerd.io/inject: enabled

But when I look at the running pod, I see:

  - image: cr.l5d.io/linkerd/debug:edge-24.7.5
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
        - "true"
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: linkerd-debug

So that explains why it works in an existing cluster -- we're not actually running it there :-/

And when I do actually run it, it fails:

$ kubectl logs deploy/hello linkerd-debug
Capturing on 'any'
tshark: Couldn't run dumpcap in child process: Operation not permitted
0 packets captured

It looks like that container is missing the NET_ADMIN and NET_RAW capabilities. If I add this to the container definition, it works:

        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - NET_RAW

Which... make sense? Those are the same capabilities that we give to proxy-init. I'm not surprised that tshark needs them as well.

I can update the debug container template to include those capabilities, but I wonder why the previous version of this image didn't need them? I can also try to track town why the config.linkerd.io/debug-image-version annotation isn't working, if that all works for you, @mateiidavid?

mateiidavid commented 2 months ago

@klingerf nooo problem, thanks for the update!

I tried to do some research to see why Alpine would require CAP_ADMIN and NET_RAW. Didn't really get anything useful. Alpine is a bit more restrictive, so perhaps there is some piece of configuration that's just not the same between the two distros (Debian and Alpine). There are some differences in the groups created, for example.

I wanted to see if we can avoid elevating the container by granting permissions only for dumpcap but that didn't work either. The root user should also be part of the wireshark group, which is sometimes a prerequisite to running tshark. I don't really see a way out of this, so we might have just have to add in the capabilities. For reference, here's a modified Dockerfile I used to poke around the container's runtime.

diff --git a/Dockerfile-debug b/Dockerfile-debug
index 522c1638b..5c0af7a8e 100644
--- a/Dockerfile-debug
+++ b/Dockerfile-debug
@@ -1,5 +1,5 @@
 FROM alpine:3.20.1
-RUN apk add \
+RUN apk update && apk upgrade && apk add \
     bind-tools \
     curl \
     iptables \
@@ -10,6 +10,11 @@ RUN apk add \
     iproute2 \
     lsof \
     conntrack-tools \
+    procps \
+    coreutils \
+    libcap \
     tshark

-ENTRYPOINT [ "tshark", "-i", "any" ]
+RUN setcap cap_net_raw,cap_net_admin=eip /usr/bin/dumpcap
+
+ENTRYPOINT ["tail", "-f", "/dev/null"]

I added procps for ps, coreutils for tail and libcap to do setcap and getcap directly on the dumpcap binary. If I try to also add the capabilities in the image build process for tshark I simply get an error that doesn't point to anything specific:

sh: tshark: Operation not permitted

I guess that's all to say, seems like granting the container caps through the k8s API is the simplest way forward.

klingerf commented 1 month ago

Given the capabilities issue mentioned in previous comments, I don't think it's worth moving forward with this change right now. I'm going to close it, but we can always revisit later.