fix(agent): Explicitly naming the prometheus metrics port in the agent daemonset

yoderme commented 3 months ago

In order for the prometheus operator to scrape agent health metrics
from the agent, the port exposing these metrics must be named in the
daemonset's podspec. This commit will provide that port name when
agent.sysdig.settings.prometheus_exporter.enabled is true. It can
also account for non-default ports.

Testing: added a unit test; manually edited a daemonset in a live
cluster to ensure that the prometheus operator really does pick up the
metrics using a PodMonitor.

github-actions[bot] commented 3 months ago

Hi @yoderme. Thanks for your PR.

After inspecting your changes someone with write access to this repo needs to approve and run the workflow.

yoderme commented 3 months ago

Since I can't create a github issue I'll put all my info and rationale here.

I have enabled the agent prometheus exporter as per the instructions in https://docs.sysdig.com/en/docs/installation/configuration/sysdig-agent/agent-health/#collect-agent-health-metrics. This meant setting

agent:
  sysdig:
    settings:
      prometheus_exporter:
        enabled: true
        export_health_metrics: true

I know for a fact that port 9544 is open and exposing metrics:

❯ kubectl -n sysdig exec sysdig-agent-5dz5q -- curl -s http://localhost:9544/metrics | grep ^sysdig                                                                   ⎈ c048-rks
sysdig_agent_analyzer_dropped_evts 0
sysdig_agent_analyzer_num_evts 8673.2
sysdig_agent_connected 1
…

Our k8s cluster uses the prometheus operator to collect metrics. https://github.com/prometheus-operator/prometheus-operator . The way one would use this to scrape metrics is to create a ServiceMonitor or PodMonitor object - the prometheus operator will see those and magically start scraping metrics. Since a ServiceMonitor will look at a service, and a PodMonitor looks at a pod, and since there’s no k8s Service for this port, I’m trying to use PodMonitor. Check out the docs: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.PodMonitor

Note especially the PodMetricsEndpoint https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.PodMetricsEndpoint. The “port” description is “Name of the Pod port which this endpoint refers to.” This requires an explicit name. It turns out that “targetPort” whose description is “Name or number of the target port of the Pod object behind the Service, the port must be specified with container port property. Deprecated: use ‘port’ instead.” Also wants the port to be explicitly named.

But I tried anyway using targetPort:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: sysdig-agent
  namespace: sysdig
spec:
  selector:
    matchLabels:
      app: sysdig-agent
  namespaceSelector:
    matchNames:
      - sysdig
  podMetricsEndpoints:
  - targetPort: 9544
    path: /metrics
    interval: 30s

When I look in the prometheus pod’s UI I see “podMonitor/sysdig/sysdig-agent/0 (0 / 73 active targets)” so I know for a fact that prometheus sees the thing and is failing to scrape it.

So … taking matters into my own hands, I edited the agent daemonset to add

>           ports:
>             - containerPort: 9544
>               name: metrics

To the sysdig-agent container. Then I created

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: sysdig-agent
  namespace: sysdig
spec:
  selector:
    matchLabels:
      app: sysdig-agent
  namespaceSelector:
    matchNames:
      - sysdig
  podMetricsEndpoints:
  - port: metrics
    path: /metrics
    interval: 30s

Once this was done, data started to flow into prometheus and then into our Grafana UI. Success!

aroberts87 commented 2 months ago

Hello, @yoderme! Thank you for the PR and very thorough write-up! We're taking a look at this now. 🙂

AlbertoBarba commented 2 months ago

@yoderme we identified an issue related to GHA and access to secrets from external contributors. In order to speed up the resolution of this, i'm going to open a PR myself to include these changes.

yoderme commented 2 months ago

@yoderme we identified an issue related to GHA and access to secrets from external contributors. In order to speed up the resolution of this, i'm going to open a PR myself to include these changes.

Hey that's great - thanks for looking at this so quickly. If you don't mind, I'd love to get a notification when this is released and available for me to officially use. Thanks!

AlbertoBarba commented 2 months ago

@yoderme Sure. I did already open the PR and it's going through the GHA checks. Once the pipeline complete we'll merge and immediately release a new version of the chart

AlbertoBarba commented 2 months ago

@yoderme should be fixed in https://github.com/sysdiglabs/charts/pull/1869 (chart available soon), thx for your contribution!

AlbertoBarba commented 2 months ago

@yoderme The fix is now available in sysdig-deploy version 1.61.5. Thanks again for your contribution!

sysdiglabs / charts

fix(agent): Explicitly naming the prometheus metrics port in the agent daemonset #1865