grafana / agent

Vendor-neutral programmable observability pipelines.
https://grafana.com/docs/agent/
Apache License 2.0
1.6k stars 487 forks source link

When using -config.expand-env flag, the agent stops sending metrics to Grafana Cloud #691

Closed brpaz closed 3 years ago

brpaz commented 3 years ago

Hello. I have deployed Grafana Agent into a Kubernetes cluster and I am pushing metrics to Grafana cloud. I was trying to remove the credentials stored into the Config file, stored as config map, and found out that I use could envrionment variables, that would be replaced at runtime by the agent with the correct values.

I follow the instructions to set the -config.expand-env=true flag, but after set it the agent just stop sending metrics to Grafana Cloud. I haven´t done anything more. I have remove the flag, the metrics start appearing again:

graph

I don´t see any error log on the agent pod.

Probably I am doing a very basic mistake, but can´t figure what.

Here is my deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana-agent
  namespace: monitoring
spec:
  minReadySeconds: 10
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: grafana-agent
  template:
    metadata:
      labels:
        name: grafana-agent
    spec:
      containers:
        - args:
            - -config.file=/etc/agent/agent.yaml
            - -config.expand-env=true
          command:
            - /bin/agent
          image: grafana/agent:v0.16.1
          imagePullPolicy: IfNotPresent
          name: agent
          ports:
            - containerPort: 12345
              name: http-metrics
          volumeMounts:
            - mountPath: /etc/agent
              name: grafana-agent
          resources:
            limits:
              cpu: 200m
              memory: 256Mi
      serviceAccount: grafana-agent
      volumes:
        - configMap:
            name: grafana-agent
          name: grafana-agent

And the agent config (without any envrionment variables interpolation yet:

server:
  http_listen_port: 12345
prometheus:
  wal_directory: /tmp/grafana-agent-wal
  global:
    scrape_interval: 15s
    external_labels:
      cluster: cloud
  configs:
    - name: integrations
      remote_write:
        - url: https://prometheus-us-central1.grafana.net/api/prom/push
          basic_auth:
            username: <redacted>
            password: <redacted>
      scrape_configs:
        - job_name: integrations/go
          static_configs:
            - targets: ["localhost:8080"]
        - job_name: integrations/kubernetes/cadvisor
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          kubernetes_sd_configs:
            - role: node
          metric_relabel_configs:
            - action: drop
              regex: container_([a-z_]+);
              source_labels:
                - __name__
                - image
            - action: drop
              regex: container_(network_tcp_usage_total|network_udp_usage_total|tasks_state|cpu_load_average_10s)
              source_labels:
                - __name__
          relabel_configs:
            - replacement: kubernetes.default.svc.cluster.local:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
              source_labels:
                - __meta_kubernetes_node_name
              target_label: __metrics_path__
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: false
            server_name: kubernetes
        - job_name: integrations/kubernetes/kubelet
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          kubernetes_sd_configs:
            - role: node
          relabel_configs:
            - replacement: kubernetes.default.svc.cluster.local:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/${1}/proxy/metrics
              source_labels:
                - __meta_kubernetes_node_name
              target_label: __metrics_path__
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: false
            server_name: kubernetes
          metric_relabel_configs:
            - source_labels: ["__name__"]
              regex: "apiserver_request_total|kubelet_node_config_error|kubelet_runtime_operations_errors_total|container_cpu_usage_seconds_total|kube_statefulset_status_replicas|kube_statefulset_status_replicas_ready|node_namespace_pod_container:container_memory_swap|kubelet_runtime_operations_total|kube_statefulset_metadata_generation|node_cpu_seconds_total|kube_pod_container_resource_limits_cpu_cores|node_namespace_pod_container:container_memory_cache|kubelet_pleg_relist_duration_seconds_bucket|scheduler_binding_duration_seconds_bucket|container_network_transmit_bytes_total|kube_pod_container_resource_requests_memory_bytes|namespace_workload_pod:kube_pod_owner:relabel|kube_statefulset_status_observed_generation|process_resident_memory_bytes|container_network_receive_packets_dropped_total|kubelet_running_containers|kubelet_pod_worker_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|container_network_transmit_packets_total|rest_client_request_duration_seconds_bucket|node_namespace_pod_container:container_memory_rss|container_cpu_cfs_throttled_periods_total|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes_used|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|kube_node_status_allocatable_memory_bytes|container_memory_cache|go_goroutines|kubelet_runtime_operations_duration_seconds_bucket|kube_statefulset_replicas|kube_pod_owner|rest_client_requests_total|container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|storage_operation_errors_total|scheduler_e2e_scheduling_duration_seconds_bucket|container_network_transmit_packets_dropped_total|kube_pod_container_resource_limits_memory_bytes|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate|storage_operation_duration_seconds_count|node_netstat_TcpExt_TCPSynRetrans|node_netstat_Tcp_OutSegs|container_cpu_cfs_periods_total|kubelet_pod_start_duration_seconds_count|kubeproxy_network_programming_duration_seconds_count|container_network_receive_bytes_total|node_netstat_Tcp_RetransSegs|up|storage_operation_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_volume_stats_available_bytes|scheduler_scheduling_algorithm_duration_seconds_bucket|kube_statefulset_status_replicas_current|code_resource:apiserver_request_total:rate5m|kube_statefulset_status_replicas_updated|process_cpu_seconds_total|kube_pod_container_resource_requests_cpu_cores|kubelet_pod_worker_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|container_memory_usage_bytes|workqueue_adds_total|container_network_receive_packets_total|container_memory_working_set_bytes|kube_resourcequota|kubelet_running_pods|kubelet_volume_stats_inodes|kubeproxy_sync_proxy_rules_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_count|apiserver_request:availability30d|container_memory_rss|kubelet_pleg_relist_interval_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_count|workqueue_depth|:node_memory_MemAvailable_bytes:sum|volume_manager_total_volumes|kube_node_status_allocatable_cpu_cores"
              action: "keep"

integrations:
  prometheus_remote_write:
    - url: https://prometheus-us-central1.grafana.net/api/prom/push
      basic_auth:
        username: <redacted>
        password: <redacted>

  node_exporter:
    enabled: true

loki:
  configs:
    - name: integrations
      clients:
        - url: https://logs-prod-us-central1.grafana.net/api/prom/push
          basic_auth:
            username: <redacted>
            password: <redacted>
          external_labels:
            cluster: cloud
      positions:
        filename: /tmp/positions.yaml
      target_config:
        sync_period: 10s

Agent logs:

ts=2021-06-27T15:37:04.5936372Z level=info agent=prometheus component=cluster msg="applying config"
ts=2021-06-27T15:37:04.594078399Z level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2021-06-27T15:37:04Z level=info msg="Tempo Logger Initialized" component=tempo
ts=2021-06-27T15:37:04.779508389Z level=info integration=node_exporter msg="Enabled node_exporter collectors"
ts=2021-06-27T15:37:04.779572771Z level=info integration=node_exporter collector=arp
ts=2021-06-27T15:37:04.779578944Z level=info integration=node_exporter collector=bcache
ts=2021-06-27T15:37:04.779583251Z level=info integration=node_exporter collector=bonding
ts=2021-06-27T15:37:04.77958749Z level=info integration=node_exporter collector=btrfs
ts=2021-06-27T15:37:04.779591646Z level=info integration=node_exporter collector=conntrack
ts=2021-06-27T15:37:04.779595771Z level=info integration=node_exporter collector=cpu
ts=2021-06-27T15:37:04.779599578Z level=info integration=node_exporter collector=cpufreq
ts=2021-06-27T15:37:04.779606148Z level=info integration=node_exporter collector=diskstats
ts=2021-06-27T15:37:04.779610056Z level=info integration=node_exporter collector=edac
ts=2021-06-27T15:37:04.779614024Z level=info integration=node_exporter collector=entropy
ts=2021-06-27T15:37:04.77961777Z level=info integration=node_exporter collector=filefd
ts=2021-06-27T15:37:04.779621676Z level=info integration=node_exporter collector=filesystem
ts=2021-06-27T15:37:04.77962562Z level=info integration=node_exporter collector=hwmon
ts=2021-06-27T15:37:04.779629592Z level=info integration=node_exporter collector=infiniband
ts=2021-06-27T15:37:04.779633424Z level=info integration=node_exporter collector=ipvs
ts=2021-06-27T15:37:04.779637435Z level=info integration=node_exporter collector=loadavg
ts=2021-06-27T15:37:04.779643439Z level=info integration=node_exporter collector=mdadm
ts=2021-06-27T15:37:04.779652438Z level=info integration=node_exporter collector=meminfo
ts=2021-06-27T15:37:04.779656001Z level=info integration=node_exporter collector=netclass
ts=2021-06-27T15:37:04.779659761Z level=info integration=node_exporter collector=netdev
ts=2021-06-27T15:37:04.779671235Z level=info integration=node_exporter collector=netstat
ts=2021-06-27T15:37:04.779675542Z level=info integration=node_exporter collector=nfs
ts=2021-06-27T15:37:04.779679262Z level=info integration=node_exporter collector=nfsd
ts=2021-06-27T15:37:04.77968321Z level=info integration=node_exporter collector=powersupplyclass
ts=2021-06-27T15:37:04.779686926Z level=info integration=node_exporter collector=pressure
ts=2021-06-27T15:37:04.77969277Z level=info integration=node_exporter collector=rapl
ts=2021-06-27T15:37:04.77969643Z level=info integration=node_exporter collector=schedstat
ts=2021-06-27T15:37:04.779700267Z level=info integration=node_exporter collector=sockstat
ts=2021-06-27T15:37:04.779703982Z level=info integration=node_exporter collector=softnet
ts=2021-06-27T15:37:04.779707832Z level=info integration=node_exporter collector=stat
ts=2021-06-27T15:37:04.77971153Z level=info integration=node_exporter collector=textfile
ts=2021-06-27T15:37:04.779715391Z level=info integration=node_exporter collector=thermal_zone
ts=2021-06-27T15:37:04.779719031Z level=info integration=node_exporter collector=time
ts=2021-06-27T15:37:04.779722889Z level=info integration=node_exporter collector=timex
ts=2021-06-27T15:37:04.779728556Z level=info integration=node_exporter collector=udp_queues
ts=2021-06-27T15:37:04.779732698Z level=info integration=node_exporter collector=uname
ts=2021-06-27T15:37:04.779736167Z level=info integration=node_exporter collector=vmstat
ts=2021-06-27T15:37:04.77973982Z level=info integration=node_exporter collector=xfs
ts=2021-06-27T15:37:04.779743327Z level=info integration=node_exporter collector=zfs
ts=2021-06-27T15:37:04.787651358Z level=info agent=prometheus msg="could not dynamically update instance, will manually restart" instance=a35f27c720395f2d3c98bca2d0a70437 reason="cannot dynamically update because instance is not running"
ts=2021-06-27T15:37:04.800077475Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="replaying WAL, this may take a while" dir=/tmp/grafana-agent-wal/a35f27c720395f2d3c98bca2d0a70437/wal
ts=2021-06-27T15:37:04.800356778Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="WAL segment loaded" segment=0 maxSegment=0
ts=2021-06-27T15:37:04.800749477Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component="discovery manager" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2021-06-27T15:37:04.802707292Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Starting WAL watcher" queue=a35f27-4a6185
ts=2021-06-27T15:37:04.802738993Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Starting scraped metadata watcher"
ts=2021-06-27T15:37:04.802905841Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Replaying WAL" queue=a35f27-4a6185
ts=2021-06-27T15:37:04.803048221Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="stopping truncation loop..."
ts=2021-06-27T15:37:04.803058528Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="stopping scrape manager..."
ts=2021-06-27T15:37:04.80306393Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="closing storage..."
ts=2021-06-27T15:37:04.803386856Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="truncation loop stopped"
ts=2021-06-27T15:37:04.803433028Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="discovery manager stopped"
ts=2021-06-27T15:37:04.803469138Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="stopping discovery manager..."
ts=2021-06-27T15:37:04.803483147Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="scrape manager stopped"
ts=2021-06-27T15:37:04.876514031Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Stopping remote storage..."
ts=2021-06-27T15:37:04.876676454Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="WAL watcher stopped" queue=a35f27-4a6185
ts=2021-06-27T15:37:04.876711712Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Stopping metadata watcher..."
ts=2021-06-27T15:37:04.876749604Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Scraped metadata watcher stopped"
ts=2021-06-27T15:37:04.877369485Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Remote storage stopped."
ts=2021-06-27T15:37:04.877679917Z level=info agent=prometheus msg="stopped instance" instance=a35f27c720395f2d3c98bca2d0a70437
ts=2021-06-27T15:37:04.878540233Z level=info msg="server configuration changed, restarting server"
ts=2021-06-27T15:37:04.879022511Z level=info caller=server.go:245 http=[::]:12345 grpc=[::]:9095 msg="server listening on addresses"
ts=2021-06-27T15:37:04.982377894Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="replaying WAL, this may take a while" dir=/tmp/grafana-agent-wal/a35f27c720395f2d3c98bca2d0a70437/wal
ts=2021-06-27T15:37:04.98305391Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="WAL segment loaded" segment=0 maxSegment=1
ts=2021-06-27T15:37:04.983189578Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 msg="WAL segment loaded" segment=1 maxSegment=1
ts=2021-06-27T15:37:04.983537439Z level=info agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component="discovery manager" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2021-06-27T15:37:04.985583407Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Starting WAL watcher" queue=a35f27-4a6185
ts=2021-06-27T15:37:04.985610331Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Starting scraped metadata watcher"
ts=2021-06-27T15:37:04.986067766Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Replaying WAL" queue=a35f27-4a6185
ts=2021-06-27T15:37:13.618436363Z agent=prometheus instance=a35f27c720395f2d3c98bca2d0a70437 component=remote level=info remote_name=a35f27-4a6185 url=https://prometheus-us-central1.grafana.net/api/prom/push msg="Done replaying WAL" duration=8.632425248s
rfratto commented 3 years ago

Hi, the issue is likely the default configs for collecting from Kubernetes. -config.expand-env will replace anything in the form ${<something>}, which includes the relabel configs used to collect metrics from nodes:

              replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
              replacement: /api/v1/nodes/${1}/proxy/metrics

This will be replaced with the empty string instead of the node name, which will break the scrape jobs. Change the instances of ${1} to $1, restart the Agent (or call /-/reload if you're running >0.14.0), and you should start to see metrics again.

This has confused me a couple of times too :)

rfratto commented 3 years ago

(You can also use $${1} in the config to escape it from being expanded but that would stop working if you turned off -config.expand-env. $1 works for both as long as long as the character after the 1 isn't alphanumeric, like the slash in this expample.)

brpaz commented 3 years ago

@rfratto ah, that makes sense. I will test that on the weekend and will close this issue then, if it works.

Thank you.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.