grafana / agent

Vendor-neutral programmable observability pipelines.
https://grafana.com/docs/agent/
Apache License 2.0
1.6k stars 487 forks source link

Flow: grafana-agent deployed with helm unable to collect node telemetry (prometheus.exporter.unix) #4207

Closed pcecot closed 1 year ago

pcecot commented 1 year ago

Discussed in https://github.com/grafana/agent/discussions/3470

Originally posted by **pcecot** April 6, 2023 agent version: v0.32.1 helm chart: grafana-agent-0.10.0 Collecting the node telemetry using `prometheus.exporter.unix` does not work. I see the following in the container logs: ``` ts=2023-04-06T08:12:50.628942484Z level=error component=prometheus.exporter.unix msg="collector failed" name=systemd duration_seconds=0.000108774 err="couldn't get dbus connection: dial unix /run/systemd/private: connect: no such file or directory" ts=2023-04-06T08:12:50.634454633Z level=error component=prometheus.exporter.unix collector=ethtool msg="ethtool link info error" err="no such device" device=docker0 errno=19 ts=2023-04-06T08:12:50.63446389Z level=error component=prometheus.exporter.unix collector=ethtool msg="ethtool driver info error" err="no such device" device=docker0 errno=19 ts=2023-04-06T08:12:50.634470465Z level=error component=prometheus.exporter.unix collector=ethtool msg="ethtool stats error" err="no such device" device=docker0 errno=19 ts=2023-04-06T08:12:50.634484609Z level=error component=prometheus.exporter.unix collector=ethtool msg="ethtool link info error" err="no such device" device=ens192 errno=19 ts=2023-04-06T08:12:50.634489309Z level=error component=prometheus.exporter.unix collector=ethtool msg="ethtool driver info error" err="no such device" device=ens192 errno=19 ts=2023-04-06T08:12:50.634494074Z level=error component=prometheus.exporter.unix collector=ethtool msg="ethtool stats error" err="no such device" device=ens192 errno=19 ``` Configuration used: ```terraform prometheus.exporter.unix { set_collectors = ["cpu", "disk", "ethtool", "systemd"] procfs_path = "/host/proc" sysfs_path = "/host/sys" rootfs_path = "/host/root" } ``` Helm chart values: ```yaml controller: securityContext: privileged: true runAsUser: 0 volumes: extra: - name: rootfs hostPath: path: / - name: sysfs hostPath: path: /sys - name: procfs hostPath: path: /proc agent: mounts: dockercontainers: true extra: - name: rootfs mountPath: /host/root readOnly: true - name: sysfs mountPath: /host/sys readOnly: true - name: procfs mountPath: /host/proc readOnly: true ``` Any suggestion what am I missing?

I went back to test it again on latest agent/helm chart version and the problem is still seen. When using grafana-agent in flow mode unix exporter is unable to scrape host metrics. At the same time running a standalone node exporter on the same k8 cluster scrapes metrics just fine.

tpaschalis commented 1 year ago

I tried reproducing with the most recent Helm chart, your values.yaml and configuration, but I got no error messages, and when I port-forwarded the pod and hit /api/v0/component/prometheus.exporter.unix/metrics I could see node_exporter metrics normally.

When you kubectl exec into the pods, can you see the /host/{sys/proc/root} mounted correctly? Is it possible that you're eg. running Kubernetes on Windows nodes?

pcecot commented 1 year ago

@tpaschalis thank you for checking this. It is a k8 cluster running on Ubuntu.

Mounts looks fine:

root@grafana-agent-flow-bcnvb:/# ls /host/
proc  root  sys

When I port forward, I do see some metrics but not the ones from the underlying host (ubuntu). In fact focusing on ethtool they are empty:

...
# TYPE node_scrape_collector_duration_seconds gauge
node_scrape_collector_duration_seconds{collector="cpu"} 0.000476708
node_scrape_collector_duration_seconds{collector="diskstats"} 0.000235677
node_scrape_collector_duration_seconds{collector="ethtool"} 0.005354028
node_scrape_collector_duration_seconds{collector="mountstats"} 0.001216449
node_scrape_collector_duration_seconds{collector="systemd"} 8.7709e-05
# HELP node_scrape_collector_success node_exporter: Whether a collector succeeded.
# TYPE node_scrape_collector_success gauge
node_scrape_collector_success{collector="cpu"} 1
node_scrape_collector_success{collector="diskstats"} 1
node_scrape_collector_success{collector="ethtool"} 1
node_scrape_collector_success{collector="mountstats"} 1
node_scrape_collector_success{collector="systemd"} 0
# HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler.
# TYPE promhttp_metric_handler_errors_total counter
promhttp_metric_handler_errors_total{cause="encoding"} 0
promhttp_metric_handler_errors_total{cause="gathering"} 0

As an example with ethtool seems that the agent detects what interfaces it should scrape, but when it does it ends up with:

collector=ethtool msg="ethtool stats error" err="no such device" device=ens192 errno=19

Can you please confirm that you can see metrics from the underlying OS on which k8 node is running? For example node_ethtool_ucast_bytes_transmitted:


# HELP node_ethtool_ucast_bytes_transmitted Network interface   ucast bytes tx
# TYPE node_ethtool_ucast_bytes_transmitted untyped
node_ethtool_ucast_bytes_transmitted{device="ens192"} 3.3931913632e+10
rfratto commented 1 year ago

ts=2023-04-06T08:12:50.628942484Z level=error component=prometheus.exporter.unix msg="collector failed" name=systemd duration_seconds=0.000108774 err="couldn't get dbus connection: dial unix /run/systemd/private: connect: no such file or directory"

Based on the error message here, it looks like /run/systemd needs to be mounted from the host into the container. That would be a requirement we weren't aware of, since the documentation for node_exporter doesn't mention it either 🤔

pcecot commented 1 year ago

I have checked this further and this could be related to the following node_exporter setting which in flow mode is started with default setting being:

--path.udev.data="/run/udev/data" udev data path.

I think we should be starting it with the following:

-path.udev.data=/host/root/run/udev/data.

Can we add a new argument udev_path so that we can override the default?

rfratto commented 1 year ago

Ah, thanks for investigating.

Can we add a new argument udev_path so that we can override the default?

Yeah, this sounds like a reasonable addition to work around the issue here 👍