kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.23k stars 645 forks source link

otel: conflicting Schema URL #1428

Closed morremeyer closed 3 weeks ago

morremeyer commented 3 weeks ago

What version of descheduler are you using?

descheduler version: v0.31.0 helm chart version: 0.30.0

Does this issue reproduce with the latest release?

Yes

Which descheduler CLI options are you using?

--policy-config-file=/policy-dir/policy.yaml
--logging-format=json
--otel-collector-endpoint=monitoring-alloy.monitoring.svc.cluster.local:4317
--otel-sample-rate=1
--v=3

**Please provide a copy of your descheduler policy config file**

```yaml
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: RemovePodsHavingTooManyRestarts
  pluginConfig:
  - args:
      includingInitContainers: true
      podRestartThreshold: 50
    name: RemovePodsHavingTooManyRestarts
    plugins:
      deschedule:
        enabled:
        - RemovePodsHavingTooManyRestarts

What k8s version are you using (kubectl version)?

kubectl version Output
$ kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-eks-036c24b

What did you do?

Deploy descheduler with the helm chart in version 0.30.0 with the following values.yaml:

image:
  repository: registry.k8s.io/descheduler/descheduler
  tag: v0.30.1

fullnameOverride: descheduler

cmdOptions:
  logging-format: json

  otel-sample-rate: 1.0
  otel-collector-endpoint: monitoring-alloy.monitoring.svc.cluster.local:4317

deschedulerPolicy:
  ## Setting the config here does not do anything, setting it in cmdOptions above
  # tracing:
  #   collectorEndpoint: monitoring-alloy.monitoring.svc.cluster.local:4317
  #   sampleRate: 1.0

  profiles:
    - name: RemovePodsHavingTooManyRestarts
      pluginConfig:
        - name: "RemovePodsHavingTooManyRestarts"
          args:
            podRestartThreshold: 50
            includingInitContainers: true
          plugins:
            deschedule:
              enabled:
                - "RemovePodsHavingTooManyRestarts"

What did you expect to see?

Descheduler running with the policy specified

What did you see instead?

Descheduler exits with an error

conflicting Schema URL: https://opentelemetry.io/schemas/1.12.0 and https://opentelemetry.io/schemas/1.24.0

Full log ``` {"ts":1717675604781.581,"caller":"server/secure_serving.go:57","msg":"Forcing use of http/1.1 only","v":0} {"ts":1717675604782.072,"caller":"dynamiccertificates/named_certificates.go:53","msg":"Loaded SNI cert","v":2,"index":0,"certName":"self-signed loopback","certDetail":"\"apiserver-loopback-client@1717675604\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1717675604\" (2024-06-06 11:06:44 +0000 UTC to 2025-06-06 11:06:44 +0000 UTC (now=2024-06-06 12:06:44.782019599 +0000 UTC))"} {"ts":1717675604782.1294,"caller":"server/secure_serving.go:213","msg":"Serving securely on [::]:10258","v":0} {"ts":1717675604782.1575,"caller":"tracing/tracing.go:109","msg":"Enabling trace GRPC client in insecure mode","v":0} {"ts":1717675604782.2544,"caller":"dynamiccertificates/tlsconfig.go:240","msg":"Starting DynamicServingCertificateController","v":0} {"ts":1717675604782.679,"caller":"tracing/tracing.go:130","msg":"failed to create traceable resource","err":"2 errors occurred detecting resource:\n\t* conflicting Schema URL: https://opentelemetry.io/schemas/1.12.0 and https://opentelemetry.io/schemas/1.24.0\n\t* user: Current requires cgo or $USER set in environment"} {"ts":1717675604782.7117,"caller":"app/server.go:96","msg":"descheduler server","err":"2 errors occurred detecting resource:\n\t* conflicting Schema URL: https://opentelemetry.io/schemas/1.12.0 and https://opentelemetry.io/schemas/1.24.0\n\t* user: Current requires cgo or $USER set in environment"} {"ts":1717675604782.729,"caller":"cli/run.go:74","msg":"command failed","err":"2 errors occurred detecting resource:\n\t* conflicting Schema URL: https://opentelemetry.io/schemas/1.12.0 and https://opentelemetry.io/schemas/1.24.0\n\t* user: Current requires cgo or $USER set in environment"} ```
damemi commented 3 weeks ago

I think this is because of our import for the otel semconv that uses 1.12, which is pretty old https://github.com/kubernetes-sigs/descheduler/blob/748495a022b7445e3bcce0be20effc4890b8a9fc/pkg/tracing/tracing.go#L30C11-L30C51

I opened https://github.com/kubernetes-sigs/descheduler/pull/1429 to bump this to 1.24 which should help. What's annoying is that we will have to keep these in sync, so I'd like to see if there's a way we can drop the direct semconv import. I think it's being used by some of the other otel imports too, and the conflict is probably coming from one of those being at a higher version.

Also not sure why this didn't come up in tests. @morremeyer is this a fatal error or is the descheduler still running with that error? Our otel implementation should be silent if it's not enabled so an error in this case is bad

morremeyer commented 3 weeks ago

@damemi it is a fatal error, it exits directly after the last log line. Once I disable it, it works fine.

I also have the problem that if set the helm values to

deschedulerPolicy:
  tracing:
    collectorEndpoint: monitoring-alloy.monitoring.svc.cluster.local:4317
    sampleRate: 1.0

It does not even recognize the config. I did not mention this in the initial comment since it also looks like the policy (see initial comment) does not do anything - it's running but no pods are evicted, even ones that look like they match the policy to me. I didn't comment about this because so far I think this is a user error and I'm still looking into why this is happening.

damemi commented 3 weeks ago

@morremeyer thanks, to help debug why your pods aren't being evicted you can check the docs which list all the reasons a pod would be skipped by default. You can also set --v=4 to increase the logging verbosity to print a message why each pod is skipped.

I'll update my PR because I don't think we should be causing fatal errors if tracing fails to start up

morremeyer commented 3 weeks ago

I agree, if tracing fails to start up, that should be a warning or error log message and then execution continues.

Thanks for looking into it this quick, I appreciate it a lot!