Closed shinkle-procore closed 2 years ago
Can you run a template command and post the resulting configmap-agent.yaml?
# Source: host-collector/charts/host-collector/templates/configmap-agent.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: host-collector-agent
labels:
helm.sh/chart: host-collector-0.18.0
app.kubernetes.io/name: host-collector
app.kubernetes.io/instance: release-name
app.kubernetes.io/version: "0.51.0"
app.kubernetes.io/managed-by: Helm
data:
relay: |
exporters:
logging: {}
otlp:
endpoint: processor.observability-pipeline.svc.cluster.local:4317
retry_on_failure:
enabled: true
initial_interval: 30s
max_elapsed_time: 120s
max_interval: 60s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 10000
timeout: 120s
tls:
insecure: true
extensions:
health_check:
endpoint: 0.0.0.0:13133
host_observer:
refresh_interval: 10s
memory_ballast:
size_in_percentage: 35
size_mib: "1638"
pprof:
block_profile_fraction: 0
endpoint: localhost:1777
mutex_profile_fraction: 0
zpages:
endpoint: localhost:55679
processors:
batch:
send_batch_max_size: 490
send_batch_size: 190
timeout: 200ms
memory_limiter:
check_interval: 5s
limit_mib: 3276
limit_percentage: 75
spike_limit_mib: 1024
spike_limit_percentage: 25
receivers:
filelog:
attributes: {}
exclude: []
fingerprint_size: 1kb
include:
- /var/log/pods/**/*.log
- /var/log/syslog
- /var/log/postgresql/*.*
- /var/lib/docker/containers/**/*.log
- /var/ossec/logs/**/*.log
- /var/www/.pry_history/**/*.*
include_file_name: false
include_file_path: true
max_concurrent_files: 1024
max_log_size: 1MiB
operators: []
poll_interval: 200ms
resource: {}
start_at: end
hostmetrics:
scrapers:
cpu: null
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_compact:
endpoint: 0.0.0.0:6831
thrift_http:
endpoint: 0.0.0.0:14268
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:55681
prometheus:
config:
scrape_configs:
- job_name: opentelemetry-collector
scrape_interval: 10s
static_configs:
- targets:
- ${MY_POD_IP}:8888
statsd:
aggregation_interval: 60s
enable_metric_type: false
endpoint: 0.0.0.0:8126
timer_histogram_mapping:
- observer_type: gauge
statsd_type: histogram
- observer_type: gauge
statsd_type: timer
zipkin:
endpoint: 0.0.0.0:9411
service:
extensions:
- host_observer
- health_check
- pprof
- zpages
- memory_ballast
pipelines:
logs:
exporters:
- otlp
processors:
- batch
- memory_limiter
receivers:
- otlp
- filelog
metrics:
exporters:
- otlp
processors:
- batch
- memory_limiter
receivers:
- otlp
- statsd
traces:
exporters:
- otlp
processors:
- batch
- memory_limiter
receivers:
- otlp
- zipkin
- jaeger
telemetry:
metrics:
address: 0.0.0.0:8888
Interestingly, this is applying defaults I wasn't even aware of... for example it imposes limit_mib: 3276
in the memory limiter when I only defined a limit_percentage in my config.
Interestingly, this is applying defaults I wasn't even aware of... for example it imposes
limit_mib: 3276
in the memory limiter when I only defined a limit_percentage in my config.
The charts don't know how to handle the limit_percentage field. It will always try to set a mib limit: https://github.com/open-telemetry/opentelemetry-helm-charts/issues/208
Please convert to using "mode" and then try again. Also, take a look at the hostmetrics example: https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-collector/examples/daemonset-hostmetrics
Hi Tyler, thanks for the suggestion. I did follow the instructions and convert to "mode". I also referenced the example receiver block you linked, but unfortunately I'm still getting the same error after re-deploying. Below is what the helm template
output looks like after my updates. Note for some reason the generated hostmetrics block contains lines like cpu: null
instead of the way I entered it in the config, which is just cpu:
. Not sure if this makes any difference in how the config is validated at runtime?
# Source: host-collector/charts/host-collector/templates/configmap-agent.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: host-collector-agent
labels:
helm.sh/chart: host-collector-0.18.0
app.kubernetes.io/name: host-collector
app.kubernetes.io/instance: release-name
app.kubernetes.io/version: "0.51.0"
app.kubernetes.io/managed-by: Helm
data:
relay: |
exporters:
logging: {}
otlp:
endpoint: processor.observability-pipeline.svc.cluster.local:4317
retry_on_failure:
enabled: true
initial_interval: 30s
max_elapsed_time: 120s
max_interval: 60s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 10000
timeout: 120s
tls:
insecure: true
extensions:
health_check:
endpoint: 0.0.0.0:13133
host_observer:
refresh_interval: 10s
memory_ballast:
size_in_percentage: 35
size_mib: "1638"
pprof:
block_profile_fraction: 0
endpoint: localhost:1777
mutex_profile_fraction: 0
zpages:
endpoint: localhost:55679
processors:
batch:
send_batch_max_size: 490
send_batch_size: 190
timeout: 200ms
memory_limiter:
check_interval: 5s
limit_percentage: 75
spike_limit_percentage: 25
receivers:
filelog:
attributes: {}
exclude: []
fingerprint_size: 1kb
include:
- /var/log/pods/**/*.log
- /var/log/syslog
- /var/log/postgresql/*.*
- /var/lib/docker/containers/**/*.log
- /var/ossec/logs/**/*.log
- /var/www/.pry_history/**/*.*
include_file_name: false
include_file_path: true
max_concurrent_files: 1024
max_log_size: 1MiB
operators: []
poll_interval: 200ms
resource: {}
start_at: end
hostmetrics:
scrapers:
cpu: null
disk: null
filesystem: null
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_compact:
endpoint: 0.0.0.0:6831
thrift_http:
endpoint: 0.0.0.0:14268
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:55681
prometheus:
config:
scrape_configs:
- job_name: opentelemetry-collector
scrape_interval: 10s
static_configs:
- targets:
- ${MY_POD_IP}:8888
statsd:
aggregation_interval: 60s
enable_metric_type: false
endpoint: 0.0.0.0:8126
timer_histogram_mapping:
- observer_type: gauge
statsd_type: histogram
- observer_type: gauge
statsd_type: timer
zipkin:
endpoint: 0.0.0.0:9411
service:
extensions:
- host_observer
- health_check
- pprof
- zpages
- memory_ballast
pipelines:
logs:
exporters:
- otlp
processors:
- batch
- memory_limiter
receivers:
- otlp
- filelog
metrics:
exporters:
- otlp
processors:
- batch
- memory_limiter
receivers:
- otlp
- statsd
- hostmetrics
traces:
exporters:
- otlp
processors:
- batch
- memory_limiter
receivers:
- otlp
- zipkin
- jaeger
telemetry:
metrics:
address: 0.0.0.0:8888
I suspect that is the issue but I'm not sure why the command is outputting like that for your install as that doesn't happen for our GitHub workflow. What version of helm are you using?
So it looks like it's behaving differently in different environments. The environment that's giving it trouble is through our continuous deployment product which is using Helm v3.7.1+g1d11fcb
. When I run it locally on v3.8.0
it seems to work fine? I don't know if that's the cause, or if there's something else going on and the version difference is incidental. Anything you're aware of that could impact behavior between those two versions by chance?
My guess is that it is the helm version but I haven't read through any patch notes to confirm. 3.7.1 is from October 2021, and our CI and our development is done with >= 3.8. There is an open issue to make the CI test with older helm versions but it has not been addressed yet: https://github.com/open-telemetry/opentelemetry-helm-charts/issues/144
I see. Well, we just tried updating our deployment product to use Helm v3.8.0+gd141386
and we're still getting the same error message as I got originally. Any other possible causes that you can think of?
When I look at the ConfigMap in our live cluster, I see:
hostmetrics:
scrapers: {}
Meaning it's somehow not reading the config I set up.
It seems that ArgoCD (the continuous delivery platform we're using) doesn't understand the structure:
hostmetrics:
scrapers:
cpu:
It's basically treating that cpu:
block as null and removing it entirely rather than keeping it in. Is there some other way we can write the OTel configuration so that it's not "empty" like this?
We've tried cpu: null
, cpu: []
, cpu: {}
, and cpu: ""
and they all throw errors.
Good to hear it's not a helm version issue.
I am not familiar with ArgoCD. It is strange that it is removing the scrappers as "cpu:" is valid yaml syntax.
You might try checking the hostmetric scrapper configuration and setting what settings each scrapper uses by default and set one of those just so ArgoCD doesn't think it is null.
Better yet, figuring out why ArgoCD is mistreating that field would be good. Lots of config in open telemetry use empty objects in the yaml configuration.
Yeah, some of the scrapers have parameters that can be set, but I think not all of them (at least, none are documented).
Agreed that there's an Argo issue going on here, which we've also brought to their attention. Thank you so much for your help!
I am also hitting something very similar here and wondering if anyone ever figured out a workaround The configmap that gets templated out is missing
processors:
cumulativetodelta:
and then that causes the same exact error message this has as well that its "not configured"
We use ArgoCD as well..
@shinkle-procore was there an argo issue for this? did you find any workarounds here? My config map is definitely missing this section entirely after helm templates it
UPDATE
processors:
cumulativetodelta: {}
WORKS
I'm using opentelemetry-collector-contrib version 0.51.0 with opentelemetry-helm-charts v0.18.0. Running as daemonset.
I have the following block set up in
values.yaml
under agentCollector -> configOverride -> receivers:Upon deploying it, I get the following error message:
If I comment out the hostmetrics block, the agent runs with no issues.
I've tried many combinations of that config block, adding / removing various scrapers, but nothing works. The rest of my receivers, processors, and exporters work fine. This is the only one giving me trouble right now.