"Operation not permitted" error in linkerd-proxy on edge-24.6.4

cinderellagarage commented 3 months ago

What is the issue?

Upgrading from stable-2.14.10 to edge-24.6.4 (have not tried anything in between due to needing the memory leak and httpRoute fixes)

How can it be reproduced?

When performing a helm upgrade to the deployment, this occurs. I have tried regenerating certs, removing the service accounts and having them recreated. Tried running as nonRoot, as Root, privileged and non-priviliged. iirc, I have to run as root due to running weave as our CNI

weave-kube:2.8.1

Logs, error output, etc

Non-debug logs from linkerd-proxy container

Defaulted container "linkerd-proxy" out of: linkerd-proxy, proxy-injector, linkerd-init (init)
time="2024-07-01T19:32:50Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2024-07-01T19:32:50Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[ 0.001394s] INFO ThreadId(01) linkerd2_proxy: release 2.238.0 (99626eb) by linkerd on 2024-06-26T03:48:04Z
[ 0.004103s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.004943s] INFO ThreadId(01) linkerd2_proxy: Admin interface on [::]:4191
[ 0.004961s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on [::]:4143
[ 0.004964s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.004967s] INFO ThreadId(01) linkerd2_proxy: Tap interface on [::]:4190
[ 0.004969s] INFO ThreadId(01) linkerd2_proxy: SNI is linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local
[ 0.004971s] INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local
[ 0.004973s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
thread 'main' panicked at linkerd/app/src/lib.rs:460:14:
admin: Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

debug logs from linkerd-proxy

Defaulted container "linkerd-proxy" out of: linkerd-proxy, proxy-injector, linkerd-init (init)
time="2024-07-02T15:17:01Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2024-07-02T15:17:01Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[ 0.001338s] INFO ThreadId(01) linkerd2_proxy: release 2.238.0 (99626eb) by linkerd on 2024-06-26T03:48:04Z
[ 0.002747s] DEBUG ThreadId(01) linkerd_app::env: Only allowing connections targeting `LINKERD2_PROXY_INBOUND_IPS` allowed={100.109.104.23}
[ 0.003529s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.003631s] DEBUG ThreadId(01) linkerd_app: Building app
[ 0.003655s] DEBUG ThreadId(01) linkerd_app: Building DNS client
[ 0.003795s] DEBUG ThreadId(01) linkerd_app: Building Identity client
[ 0.003879s] DEBUG ThreadId(01) identity: rustls::anchors: add_parsable_certificates processed 1 valid and 0 invalid certs
[ 0.003941s] DEBUG ThreadId(01) linkerd_app: Building Tap server config=Enabled { config: ServerConfig { addr: DualListenAddr([::]:4190, None), keepalive: Keepalive(Some(10s)), http2: ServerParams { flow_control: Some(Fixed { initial_stream_window_size: 65535, initial_connection_window_size: 1048576 }), keep_alive: Some(KeepAlive { interval: 10s, timeout: 3s }), max_concurrent_streams: None, max_frame_size: None, max_header_list_size: None, max_pending_accept_reset_streams: None, max_send_buf_size: None } }, permitted_client_ids: {ClientId(Dns(Name("tap.linkerd-viz.serviceaccount.identity.linkerd.cluster.local")))} }
[ 0.004038s] DEBUG ThreadId(01) linkerd_app: Building Destination client
[ 0.004079s] DEBUG ThreadId(01) dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_proxy_balance: Resolving
[ 0.004109s] DEBUG ThreadId(01) dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_proxy_balance: Spawning p2c pool queue capacity=1000 failfast=30s
[ 0.004152s] DEBUG ThreadId(01) linkerd_app: Building Policy client
[ 0.004177s] DEBUG ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_proxy_balance: Resolving
[ 0.004194s] DEBUG ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_proxy_balance: Spawning p2c pool queue capacity=1000 failfast=30s
[ 0.004210s] DEBUG ThreadId(01) linkerd_app: Building client config=Disabled
[ 0.004466s] DEBUG ThreadId(01) linkerd_metrics::process: Process metrics registered
[ 0.004517s] INFO ThreadId(01) linkerd2_proxy: Admin interface on [::]:4191
[ 0.004530s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on [::]:4143
[ 0.004532s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.004535s] INFO ThreadId(01) linkerd2_proxy: Tap interface on [::]:4190
[ 0.004537s] INFO ThreadId(01) linkerd2_proxy: SNI is linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local
[ 0.004538s] INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local
[ 0.004541s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.004543s] DEBUG ThreadId(01) linkerd_app: spawning daemon thread
thread 'main' panicked at linkerd/app/src/lib.rs:460:14:
admin: Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

proxy-injector logs

logs -n linkerd linkerd-proxy-injector-55b7669475-8tzm6 proxy-injector
time="2024-07-02T15:17:37Z" level=info msg="running version edge-24.6.4"
time="2024-07-02T15:17:37Z" level=info msg="starting admin server on :9995"
time="2024-07-02T15:17:37Z" level=info msg="waiting for caches to sync"
time="2024-07-02T15:17:37Z" level=info msg="listening at :8443"
time="2024-07-02T15:17:38Z" level=info msg="caches synced"
time="2024-07-02T15:18:11Z" level=info msg="shutting down webhook server"
time="2024-07-02T15:18:11Z" level=error msg="failed to start webhook admin server: http: Server closed"

output of `linkerd check -o short`

Since control plane does not come up healthy

control-plane-version
---------------------
‼ control plane is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running stable-2.14.10 but cli running edge-24.6.4
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-destination-5778b7b86-djhsq (stable-2.14.10)
    * linkerd-destination-5778b7b86-fjht9 (stable-2.14.10)
    * linkerd-destination-5778b7b86-t4qhf (stable-2.14.10)
    * linkerd-destination-5778b7b86-wjtqs (stable-2.14.10)
    * linkerd-destination-5778b7b86-z46lq (stable-2.14.10)
    * linkerd-identity-5b57fd59f7-jrz2h (stable-2.14.10)
    * linkerd-identity-5b57fd59f7-rdw7r (stable-2.14.10)
    * linkerd-identity-5b57fd59f7-sq7qz (stable-2.14.10)
    * linkerd-identity-5b57fd59f7-sw498 (stable-2.14.10)
    * linkerd-identity-5b57fd59f7-vmjc8 (stable-2.14.10)
    * linkerd-proxy-injector-68cc85d56d-6zkfx (stable-2.14.10)
    * linkerd-proxy-injector-68cc85d56d-t85xw (stable-2.14.10)
    * linkerd-proxy-injector-68cc85d56d-tpxq9 (stable-2.14.10)
    * linkerd-proxy-injector-68cc85d56d-vhjkh (stable-2.14.10)
    * linkerd-proxy-injector-68cc85d56d-z6sps (stable-2.14.10)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-5778b7b86-djhsq running stable-2.14.10 but cli running edge-24.6.4
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

Environment

Kubernetes: 1.22
kOps
AWS
Weave: 2.8.1
Linkerd: edge-24.6.4
Ubuntu 20.04

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

wmorgan commented 2 months ago

Thanks for this. Possibly related to the new IPv6 code. We will take a look.

alpeb commented 2 months ago

Can you try enabling RUST_BACKTRACE to get a clearer error message? In the CLI you can do that with:

linkerd upgrade --set proxy.additionalEnv[0].name=RUST_BACKTRACE --set-string proxy.additionalEnv[0].value=1

and then rollout the workload where you're observing this.

cinderellagarage commented 2 months ago

I install Linkerd with Helm, so the linkerd upgrade does not work. I edited the deployment to get the backtrace

Defaulted container "linkerd-proxy" out of: linkerd-proxy, destination, sp-validator, policy, linkerd-init (init)
time="2024-07-15T20:04:44Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2024-07-15T20:04:44Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[     0.001740s]  INFO ThreadId(01) linkerd2_proxy: release 2.238.0 (99626eb) by linkerd on 2024-06-26T03:48:04Z
[     0.003771s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.004443s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on [::]:4191
[     0.004459s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on [::]:4143
[     0.004461s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.004463s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.004464s]  INFO ThreadId(01) linkerd2_proxy: SNI is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.004466s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.004467s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via localhost:8086
thread 'main' panicked at linkerd/app/src/lib.rs:460:14:
admin: Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }
stack backtrace:
   0:     0x5638dcefc531 - <unknown>
   1:     0x5638dc0127a0 - <unknown>
   2:     0x5638dced2b5e - <unknown>
   3:     0x5638dcefe08e - <unknown>
   4:     0x5638dcefd937 - <unknown>
   5:     0x5638dcefe911 - <unknown>
   6:     0x5638dcefe400 - <unknown>
   7:     0x5638dcefe356 - <unknown>
   8:     0x5638dcefe34f - <unknown>
   9:     0x5638dbf11bf4 - <unknown>
  10:     0x5638dbf12162 - <unknown>
  11:     0x5638dca0a386 - <unknown>
  12:     0x5638dc7e475c - <unknown>
  13:     0x5638dc63c09c - <unknown>
  14:     0x5638dc65d043 - <unknown>
  15:     0x5638dc6385c1 - <unknown>
  16:     0x7f0e8d44724a - <unknown>
  17:     0x7f0e8d447305 - __libc_start_main
  18:     0x5638dbf83571 - <unknown>
  19:                0x0 - <unknown>

olix0r commented 2 months ago

I believe that this error is actually being returned from std::thread::Builder::spawn... The code in question is basically

        std::thread::Builder::new()
            .name("admin".into())
            .spawn(move || {
               ...
            })
            .expect("admin");

Indicating that this error comes from the spawn call and not the admin thread.

This sounds like something system-level is preventing the creation of a an additional runtime thread.

To proceed we'd likely need to collect detailed version information about the operating system, container runtime, and Kubernetes cluster configuration.

cinderellagarage commented 2 months ago

kubernetesVersion: 1.22.17
DockerVersion: 20.10.8
Ubuntu: 20.04

olix0r commented 2 months ago

I think we're likely seeing something related to this (there are many similar issues if you search around):

I think it is quite likely that this happens because in newer glibc, the clone3 syscall is used by default when creating a new thread (using for example pthread_create). When using a newer version of glibc on an older kernel (such as running a newer debian docker image on an older ubuntu kernel (18.04.4 uses kernel 4.15)), this should cause an ENOSYS error, but older versions of docker mistakenly return an EPERM, causing glibc to not retry with the clone syscall instead, subsequently causing this error. This can be fixed by using a newer version of docker including https://github.com/moby/moby/commit/9f6b562dd12ef7b1f9e2f8e6f2ab6477790a6594 or updating your distro, as ubuntu 18.04.5 uses kernel 5.4 (and clone3 was introduced in kernel 5.3). ...

This is almost definitely what's going on:

We have updated our glibc dependency within the docker image
Your OS likely does not support the new syscall
This version of the container runtime (Docker) mishandles this error situation

We can probably pin down the edge version where glibc was updated, but there does seem to be a fundamental incompatibility here caused by the Docker bug.

olix0r commented 2 months ago

More here:

In summary, make sure you are using docker v20.10.10 if using docker-ce or a patched older version if using docker.io when running images with glibc v2.34+.

cinderellagarage commented 2 months ago

Thank you!

wmorgan commented 2 months ago

@cinderellagarage Out of curiosity, could you paste in the output of kubectl get node -o jsonpath="{.items[*].status.nodeInfo.containerRuntimeVersion}" on your system? We're working on providing some guidance to users about this issue.

cinderellagarage commented 2 months ago

Sure thing!

kubectl get node -o jsonpath="{.items[*].status.nodeInfo.containerRuntimeVersion}"
docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8

wmorgan commented 2 months ago

Thanks, that's exactly what I was hoping to see. We should be able to write our guidance now for the upcoming 2.16 release. In the meantime, I hope you are able to upgrade to 20.10.10 or beyond!

cinderellagarage commented 2 months ago

Thank you! This bug has actually helped light a fire on us upgrading many other things, so hopefully we can get past it soon.

linkerd / linkerd2