Closed morremeyer closed 3 weeks ago
I think this is because of our import for the otel semconv that uses 1.12, which is pretty old https://github.com/kubernetes-sigs/descheduler/blob/748495a022b7445e3bcce0be20effc4890b8a9fc/pkg/tracing/tracing.go#L30C11-L30C51
I opened https://github.com/kubernetes-sigs/descheduler/pull/1429 to bump this to 1.24 which should help. What's annoying is that we will have to keep these in sync, so I'd like to see if there's a way we can drop the direct semconv import. I think it's being used by some of the other otel imports too, and the conflict is probably coming from one of those being at a higher version.
Also not sure why this didn't come up in tests. @morremeyer is this a fatal error or is the descheduler still running with that error? Our otel implementation should be silent if it's not enabled so an error in this case is bad
@damemi it is a fatal error, it exits directly after the last log line. Once I disable it, it works fine.
I also have the problem that if set the helm values to
deschedulerPolicy:
tracing:
collectorEndpoint: monitoring-alloy.monitoring.svc.cluster.local:4317
sampleRate: 1.0
It does not even recognize the config. I did not mention this in the initial comment since it also looks like the policy (see initial comment) does not do anything - it's running but no pods are evicted, even ones that look like they match the policy to me. I didn't comment about this because so far I think this is a user error and I'm still looking into why this is happening.
@morremeyer thanks, to help debug why your pods aren't being evicted you can check the docs which list all the reasons a pod would be skipped by default. You can also set --v=4
to increase the logging verbosity to print a message why each pod is skipped.
I'll update my PR because I don't think we should be causing fatal errors if tracing fails to start up
I agree, if tracing fails to start up, that should be a warning or error log message and then execution continues.
Thanks for looking into it this quick, I appreciate it a lot!
What version of descheduler are you using?
descheduler version: v0.31.0 helm chart version: 0.30.0
Does this issue reproduce with the latest release?
Yes
Which descheduler CLI options are you using?
What k8s version are you using (
kubectl version
)?kubectl version
OutputWhat did you do?
Deploy descheduler with the helm chart in version 0.30.0 with the following
values.yaml
:What did you expect to see?
Descheduler running with the policy specified
What did you see instead?
Descheduler exits with an error
Full log
``` {"ts":1717675604781.581,"caller":"server/secure_serving.go:57","msg":"Forcing use of http/1.1 only","v":0} {"ts":1717675604782.072,"caller":"dynamiccertificates/named_certificates.go:53","msg":"Loaded SNI cert","v":2,"index":0,"certName":"self-signed loopback","certDetail":"\"apiserver-loopback-client@1717675604\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1717675604\" (2024-06-06 11:06:44 +0000 UTC to 2025-06-06 11:06:44 +0000 UTC (now=2024-06-06 12:06:44.782019599 +0000 UTC))"} {"ts":1717675604782.1294,"caller":"server/secure_serving.go:213","msg":"Serving securely on [::]:10258","v":0} {"ts":1717675604782.1575,"caller":"tracing/tracing.go:109","msg":"Enabling trace GRPC client in insecure mode","v":0} {"ts":1717675604782.2544,"caller":"dynamiccertificates/tlsconfig.go:240","msg":"Starting DynamicServingCertificateController","v":0} {"ts":1717675604782.679,"caller":"tracing/tracing.go:130","msg":"failed to create traceable resource","err":"2 errors occurred detecting resource:\n\t* conflicting Schema URL: https://opentelemetry.io/schemas/1.12.0 and https://opentelemetry.io/schemas/1.24.0\n\t* user: Current requires cgo or $USER set in environment"} {"ts":1717675604782.7117,"caller":"app/server.go:96","msg":"descheduler server","err":"2 errors occurred detecting resource:\n\t* conflicting Schema URL: https://opentelemetry.io/schemas/1.12.0 and https://opentelemetry.io/schemas/1.24.0\n\t* user: Current requires cgo or $USER set in environment"} {"ts":1717675604782.729,"caller":"cli/run.go:74","msg":"command failed","err":"2 errors occurred detecting resource:\n\t* conflicting Schema URL: https://opentelemetry.io/schemas/1.12.0 and https://opentelemetry.io/schemas/1.24.0\n\t* user: Current requires cgo or $USER set in environment"} ```