open-telemetry / opentelemetry-helm-charts

OpenTelemetry Helm Charts
https://opentelemetry.io
Apache License 2.0
399 stars 486 forks source link

Hostmetrics causes "invalid configuration" error despite being correctly configured #222

Closed shinkle-procore closed 2 years ago

shinkle-procore commented 2 years ago

I'm using opentelemetry-collector-contrib version 0.51.0 with opentelemetry-helm-charts v0.18.0. Running as daemonset.

I have the following block set up in values.yaml under agentCollector -> configOverride -> receivers:

        # Host Metrics Receiver
        hostmetrics:
          scrapers:
            cpu:

Upon deploying it, I get the following error message:

2022/06/06 20:20:09 collector server run finished with error: failed to get config: invalid configuration: 
receiver "hostmetrics" has invalid configuration: must specify at least one scraper when using hostmetrics 
receiver

If I comment out the hostmetrics block, the agent runs with no issues.

I've tried many combinations of that config block, adding / removing various scrapers, but nothing works. The rest of my receivers, processors, and exporters work fine. This is the only one giving me trouble right now.

TylerHelmuth commented 2 years ago

Can you run a template command and post the resulting configmap-agent.yaml?

shinkle-procore commented 2 years ago
# Source: host-collector/charts/host-collector/templates/configmap-agent.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: host-collector-agent
  labels:
    helm.sh/chart: host-collector-0.18.0
    app.kubernetes.io/name: host-collector
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "0.51.0"
    app.kubernetes.io/managed-by: Helm
data:
  relay: |
    exporters:
      logging: {}
      otlp:
        endpoint: processor.observability-pipeline.svc.cluster.local:4317
        retry_on_failure:
          enabled: true
          initial_interval: 30s
          max_elapsed_time: 120s
          max_interval: 60s
        sending_queue:
          enabled: true
          num_consumers: 10
          queue_size: 10000
        timeout: 120s
        tls:
          insecure: true
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      host_observer:
        refresh_interval: 10s
      memory_ballast:
        size_in_percentage: 35
        size_mib: "1638"
      pprof:
        block_profile_fraction: 0
        endpoint: localhost:1777
        mutex_profile_fraction: 0
      zpages:
        endpoint: localhost:55679
    processors:
      batch:
        send_batch_max_size: 490
        send_batch_size: 190
        timeout: 200ms
      memory_limiter:
        check_interval: 5s
        limit_mib: 3276
        limit_percentage: 75
        spike_limit_mib: 1024
        spike_limit_percentage: 25
    receivers:
      filelog:
        attributes: {}
        exclude: []
        fingerprint_size: 1kb
        include:
        - /var/log/pods/**/*.log
        - /var/log/syslog
        - /var/log/postgresql/*.*
        - /var/lib/docker/containers/**/*.log
        - /var/ossec/logs/**/*.log
        - /var/www/.pry_history/**/*.*
        include_file_name: false
        include_file_path: true
        max_concurrent_files: 1024
        max_log_size: 1MiB
        operators: []
        poll_interval: 200ms
        resource: {}
        start_at: end
      hostmetrics:
        scrapers:
          cpu: null
      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_compact:
            endpoint: 0.0.0.0:6831
          thrift_http:
            endpoint: 0.0.0.0:14268
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:55681
      prometheus:
        config:
          scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${MY_POD_IP}:8888
      statsd:
        aggregation_interval: 60s
        enable_metric_type: false
        endpoint: 0.0.0.0:8126
        timer_histogram_mapping:
        - observer_type: gauge
          statsd_type: histogram
        - observer_type: gauge
          statsd_type: timer
      zipkin:
        endpoint: 0.0.0.0:9411
    service:
      extensions:
      - host_observer
      - health_check
      - pprof
      - zpages
      - memory_ballast
      pipelines:
        logs:
          exporters:
          - otlp
          processors:
          - batch
          - memory_limiter
          receivers:
          - otlp
          - filelog
        metrics:
          exporters:
          - otlp
          processors:
          - batch
          - memory_limiter
          receivers:
          - otlp
          - statsd
        traces:
          exporters:
          - otlp
          processors:
          - batch
          - memory_limiter
          receivers:
          - otlp
          - zipkin
          - jaeger
      telemetry:
        metrics:
          address: 0.0.0.0:8888
shinkle-procore commented 2 years ago

Interestingly, this is applying defaults I wasn't even aware of... for example it imposes limit_mib: 3276 in the memory limiter when I only defined a limit_percentage in my config.

TylerHelmuth commented 2 years ago

Interestingly, this is applying defaults I wasn't even aware of... for example it imposes limit_mib: 3276 in the memory limiter when I only defined a limit_percentage in my config.

The charts don't know how to handle the limit_percentage field. It will always try to set a mib limit: https://github.com/open-telemetry/opentelemetry-helm-charts/issues/208

TylerHelmuth commented 2 years ago

Please convert to using "mode" and then try again. Also, take a look at the hostmetrics example: https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-collector/examples/daemonset-hostmetrics

shinkle-procore commented 2 years ago

Hi Tyler, thanks for the suggestion. I did follow the instructions and convert to "mode". I also referenced the example receiver block you linked, but unfortunately I'm still getting the same error after re-deploying. Below is what the helm template output looks like after my updates. Note for some reason the generated hostmetrics block contains lines like cpu: null instead of the way I entered it in the config, which is just cpu:. Not sure if this makes any difference in how the config is validated at runtime?

# Source: host-collector/charts/host-collector/templates/configmap-agent.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: host-collector-agent
  labels:
    helm.sh/chart: host-collector-0.18.0
    app.kubernetes.io/name: host-collector
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "0.51.0"
    app.kubernetes.io/managed-by: Helm
data:
  relay: |
    exporters:
      logging: {}
      otlp:
        endpoint: processor.observability-pipeline.svc.cluster.local:4317
        retry_on_failure:
          enabled: true
          initial_interval: 30s
          max_elapsed_time: 120s
          max_interval: 60s
        sending_queue:
          enabled: true
          num_consumers: 10
          queue_size: 10000
        timeout: 120s
        tls:
          insecure: true
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      host_observer:
        refresh_interval: 10s
      memory_ballast:
        size_in_percentage: 35
        size_mib: "1638"
      pprof:
        block_profile_fraction: 0
        endpoint: localhost:1777
        mutex_profile_fraction: 0
      zpages:
        endpoint: localhost:55679
    processors:
      batch:
        send_batch_max_size: 490
        send_batch_size: 190
        timeout: 200ms
      memory_limiter:
        check_interval: 5s
        limit_percentage: 75
        spike_limit_percentage: 25
    receivers:
      filelog:
        attributes: {}
        exclude: []
        fingerprint_size: 1kb
        include:
        - /var/log/pods/**/*.log
        - /var/log/syslog
        - /var/log/postgresql/*.*
        - /var/lib/docker/containers/**/*.log
        - /var/ossec/logs/**/*.log
        - /var/www/.pry_history/**/*.*
        include_file_name: false
        include_file_path: true
        max_concurrent_files: 1024
        max_log_size: 1MiB
        operators: []
        poll_interval: 200ms
        resource: {}
        start_at: end
      hostmetrics:
        scrapers:
          cpu: null
          disk: null
          filesystem: null
      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_compact:
            endpoint: 0.0.0.0:6831
          thrift_http:
            endpoint: 0.0.0.0:14268
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:55681
      prometheus:
        config:
          scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${MY_POD_IP}:8888
      statsd:
        aggregation_interval: 60s
        enable_metric_type: false
        endpoint: 0.0.0.0:8126
        timer_histogram_mapping:
        - observer_type: gauge
          statsd_type: histogram
        - observer_type: gauge
          statsd_type: timer
      zipkin:
        endpoint: 0.0.0.0:9411
    service:
      extensions:
      - host_observer
      - health_check
      - pprof
      - zpages
      - memory_ballast
      pipelines:
        logs:
          exporters:
          - otlp
          processors:
          - batch
          - memory_limiter
          receivers:
          - otlp
          - filelog
        metrics:
          exporters:
          - otlp
          processors:
          - batch
          - memory_limiter
          receivers:
          - otlp
          - statsd
          - hostmetrics
        traces:
          exporters:
          - otlp
          processors:
          - batch
          - memory_limiter
          receivers:
          - otlp
          - zipkin
          - jaeger
      telemetry:
        metrics:
          address: 0.0.0.0:8888
TylerHelmuth commented 2 years ago

I suspect that is the issue but I'm not sure why the command is outputting like that for your install as that doesn't happen for our GitHub workflow. What version of helm are you using?

shinkle-procore commented 2 years ago

So it looks like it's behaving differently in different environments. The environment that's giving it trouble is through our continuous deployment product which is using Helm v3.7.1+g1d11fcb. When I run it locally on v3.8.0 it seems to work fine? I don't know if that's the cause, or if there's something else going on and the version difference is incidental. Anything you're aware of that could impact behavior between those two versions by chance?

TylerHelmuth commented 2 years ago

My guess is that it is the helm version but I haven't read through any patch notes to confirm. 3.7.1 is from October 2021, and our CI and our development is done with >= 3.8. There is an open issue to make the CI test with older helm versions but it has not been addressed yet: https://github.com/open-telemetry/opentelemetry-helm-charts/issues/144

shinkle-procore commented 2 years ago

I see. Well, we just tried updating our deployment product to use Helm v3.8.0+gd141386 and we're still getting the same error message as I got originally. Any other possible causes that you can think of?

shinkle-procore commented 2 years ago

When I look at the ConfigMap in our live cluster, I see:

hostmetrics:
        scrapers: {}

Meaning it's somehow not reading the config I set up.

shinkle-procore commented 2 years ago

It seems that ArgoCD (the continuous delivery platform we're using) doesn't understand the structure:

hostmetrics:
  scrapers:
    cpu:

It's basically treating that cpu: block as null and removing it entirely rather than keeping it in. Is there some other way we can write the OTel configuration so that it's not "empty" like this?

We've tried cpu: null, cpu: [], cpu: {}, and cpu: "" and they all throw errors.

TylerHelmuth commented 2 years ago

Good to hear it's not a helm version issue.

I am not familiar with ArgoCD. It is strange that it is removing the scrappers as "cpu:" is valid yaml syntax.

You might try checking the hostmetric scrapper configuration and setting what settings each scrapper uses by default and set one of those just so ArgoCD doesn't think it is null.

Better yet, figuring out why ArgoCD is mistreating that field would be good. Lots of config in open telemetry use empty objects in the yaml configuration.

shinkle-procore commented 2 years ago

Yeah, some of the scrapers have parameters that can be set, but I think not all of them (at least, none are documented).

Agreed that there's an Argo issue going on here, which we've also brought to their attention. Thank you so much for your help!

lorelei-rupp-imprivata commented 1 month ago

I am also hitting something very similar here and wondering if anyone ever figured out a workaround The configmap that gets templated out is missing

processors:
  cumulativetodelta:

and then that causes the same exact error message this has as well that its "not configured"

We use ArgoCD as well..

@shinkle-procore was there an argo issue for this? did you find any workarounds here? My config map is definitely missing this section entirely after helm templates it

UPDATE

processors:
  cumulativetodelta: {}

WORKS