Closed cinderellagarage closed 1 month ago
Thanks for this. Possibly related to the new IPv6 code. We will take a look.
Can you try enabling RUST_BACKTRACE to get a clearer error message? In the CLI you can do that with:
linkerd upgrade --set proxy.additionalEnv[0].name=RUST_BACKTRACE --set-string proxy.additionalEnv[0].value=1
and then rollout the workload where you're observing this.
I install Linkerd with Helm, so the linkerd upgrade
does not work. I edited the deployment to get the backtrace
Defaulted container "linkerd-proxy" out of: linkerd-proxy, destination, sp-validator, policy, linkerd-init (init)
time="2024-07-15T20:04:44Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2024-07-15T20:04:44Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[ 0.001740s] INFO ThreadId(01) linkerd2_proxy: release 2.238.0 (99626eb) by linkerd on 2024-06-26T03:48:04Z
[ 0.003771s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.004443s] INFO ThreadId(01) linkerd2_proxy: Admin interface on [::]:4191
[ 0.004459s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on [::]:4143
[ 0.004461s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.004463s] INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[ 0.004464s] INFO ThreadId(01) linkerd2_proxy: SNI is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[ 0.004466s] INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[ 0.004467s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via localhost:8086
thread 'main' panicked at linkerd/app/src/lib.rs:460:14:
admin: Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }
stack backtrace:
0: 0x5638dcefc531 - <unknown>
1: 0x5638dc0127a0 - <unknown>
2: 0x5638dced2b5e - <unknown>
3: 0x5638dcefe08e - <unknown>
4: 0x5638dcefd937 - <unknown>
5: 0x5638dcefe911 - <unknown>
6: 0x5638dcefe400 - <unknown>
7: 0x5638dcefe356 - <unknown>
8: 0x5638dcefe34f - <unknown>
9: 0x5638dbf11bf4 - <unknown>
10: 0x5638dbf12162 - <unknown>
11: 0x5638dca0a386 - <unknown>
12: 0x5638dc7e475c - <unknown>
13: 0x5638dc63c09c - <unknown>
14: 0x5638dc65d043 - <unknown>
15: 0x5638dc6385c1 - <unknown>
16: 0x7f0e8d44724a - <unknown>
17: 0x7f0e8d447305 - __libc_start_main
18: 0x5638dbf83571 - <unknown>
19: 0x0 - <unknown>
I believe that this error is actually being returned from std::thread::Builder::spawn
... The code in question is basically
std::thread::Builder::new()
.name("admin".into())
.spawn(move || {
...
})
.expect("admin");
Indicating that this error comes from the spawn
call and not the admin thread.
This sounds like something system-level is preventing the creation of a an additional runtime thread.
To proceed we'd likely need to collect detailed version information about the operating system, container runtime, and Kubernetes cluster configuration.
kubernetesVersion: 1.22.17
DockerVersion: 20.10.8
Ubuntu: 20.04
I think we're likely seeing something related to this (there are many similar issues if you search around):
I think it is quite likely that this happens because in newer glibc, the clone3 syscall is used by default when creating a new thread (using for example pthread_create). When using a newer version of glibc on an older kernel (such as running a newer debian docker image on an older ubuntu kernel (18.04.4 uses kernel 4.15)), this should cause an ENOSYS error, but older versions of docker mistakenly return an EPERM, causing glibc to not retry with the clone syscall instead, subsequently causing this error. This can be fixed by using a newer version of docker including https://github.com/moby/moby/commit/9f6b562dd12ef7b1f9e2f8e6f2ab6477790a6594 or updating your distro, as ubuntu 18.04.5 uses kernel 5.4 (and clone3 was introduced in kernel 5.3). ...
This is almost definitely what's going on:
We can probably pin down the edge version where glibc was updated, but there does seem to be a fundamental incompatibility here caused by the Docker bug.
More here:
In summary, make sure you are using docker v20.10.10 if using docker-ce or a patched older version if using docker.io when running images with glibc v2.34+.
Thank you!
@cinderellagarage Out of curiosity, could you paste in the output of kubectl get node -o jsonpath="{.items[*].status.nodeInfo.containerRuntimeVersion}"
on your system? We're working on providing some guidance to users about this issue.
Sure thing!
kubectl get node -o jsonpath="{.items[*].status.nodeInfo.containerRuntimeVersion}"
docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8 docker://20.10.8
Thanks, that's exactly what I was hoping to see. We should be able to write our guidance now for the upcoming 2.16 release. In the meantime, I hope you are able to upgrade to 20.10.10 or beyond!
Thank you! This bug has actually helped light a fire on us upgrading many other things, so hopefully we can get past it soon.
What is the issue?
Upgrading from stable-2.14.10 to edge-24.6.4 (have not tried anything in between due to needing the memory leak and httpRoute fixes)
How can it be reproduced?
When performing a helm upgrade to the deployment, this occurs. I have tried regenerating certs, removing the service accounts and having them recreated. Tried running as nonRoot, as Root, privileged and non-priviliged. iirc, I have to run as root due to running weave as our CNI
weave-kube:2.8.1
Logs, error output, etc
Non-debug logs from linkerd-proxy container
debug logs from linkerd-proxy
proxy-injector logs
output of
linkerd check -o short
Since control plane does not come up healthy
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None