grafana / agent

Vendor-neutral programmable observability pipelines.
https://grafana.com/docs/agent/
Apache License 2.0
1.59k stars 485 forks source link

Using `otelcol.receiver.prometheus` in flow mode leads to "Permanent error: rpc error: code = Unimplemented" #5730

Closed aerfio closed 10 months ago

aerfio commented 10 months ago

What's wrong?

Following documentation for otelcol.receiver.prometheus component leads to errors in grafana-agent deployed on K8s cluster using newest chart (0.27.2 as of writing this issue):

ts=2023-11-07T17:18:03.068655246Z level=error msg="Exporting failed. The error is not retryable. Dropping data." component=otelcol.exporter.otlp.default error="Permanent error: rpc error: code = Unimplemented desc = unknown service opentelemetry.proto.collector.metrics.v1.MetricsService" dropped_items=243

Steps to reproduce

Install Grafana Agent on k8s cluster with provided values.yaml (in which I've set the chart to use manually created configmap) + with following configmap

apiVersion: v1
data:
  config.river: |
    prometheus.scrape "default" {
        // Collect metrics from Grafana Agent's default HTTP listen address.
        targets = [{"__address__"   = "127.0.0.1:80"}]

        forward_to = [otelcol.receiver.prometheus.default.receiver]
    }

    otelcol.receiver.prometheus "default" {
      output {
        metrics = [otelcol.exporter.otlp.default.input]
      }
    }

    otelcol.exporter.otlp "default" {
      client {
        endpoint = "REDACTED:4317"
      }
    }
    logging {
      level = "debug"
    }
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: grafana-agent-flow-custom-config
  namespace: grafana-agent

In comparison to https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.receiver.prometheus/#example I've only changed the port in prometheus.scrape from 12345 to 80 (because those metrics are exposed there, just checked!) and the endpoint has been redacted and logging.level=debug has been added. This leads to:

ts=2023-11-07T17:22:14.579136625Z level=info "boringcrypto enabled"=false
ts=2023-11-07T17:22:14.581198873Z level=info msg="starting complete graph evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e
ts=2023-11-07T17:22:14.581263174Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=otel duration=5.887µs
ts=2023-11-07T17:22:14.582303293Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=otelcol.exporter.otlp.default duration=1.010948ms
ts=2023-11-07T17:22:14.582484722Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=otelcol.receiver.prometheus.default duration=140.479µs
ts=2023-11-07T17:22:14.582535625Z level=info msg="applying non-TLS config to HTTP server" service=http
ts=2023-11-07T17:22:14.582551168Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=http duration=34.725µs
ts=2023-11-07T17:22:14.582580789Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=ui duration=9.577µs
ts=2023-11-07T17:22:14.58261835Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=cluster duration=14.208µs
ts=2023-11-07T17:22:14.583548212Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=prometheus.scrape.default duration=901.652µs
ts=2023-11-07T17:22:14.583611058Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=tracing duration=27.244µs
ts=2023-11-07T17:22:14.583705853Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=logging duration=66.921µs
ts=2023-11-07T17:22:14.583724099Z level=info msg="finished complete graph evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e duration=2.804427ms
ts=2023-11-07T17:22:14.583747887Z level=debug msg="changing node state" from=viewer to=participant
ts=2023-11-07T17:22:14.583767861Z level=debug msg="grafana-agent-flow-87nv8 @1: participant"
ts=2023-11-07T17:22:14.58387949Z level=info msg="scheduling loaded components and services"
ts=2023-11-07T17:22:14.584327712Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.default duration=274.951µs
ts=2023-11-07T17:22:14.584376857Z level=info msg="finished node evaluation" controller_id="" node_id=otelcol.receiver.prometheus.default duration=416.025µs
ts=2023-11-07T17:22:14.584267203Z level=debug msg="scheduling components" component=otelcol.exporter.otlp.default count=3
ts=2023-11-07T17:22:14.584321171Z level=info msg="starting cluster node" peers="" advertise_addr=127.0.0.1:80
ts=2023-11-07T17:22:14.584524942Z level=debug msg="passed new targets to scrape manager" component=prometheus.scrape.default
ts=2023-11-07T17:22:14.584532304Z level=debug msg="grafana-agent-flow-87nv8 @3: participant"
ts=2023-11-07T17:22:14.584751801Z level=info msg="peers changed" new_peers=grafana-agent-flow-87nv8
ts=2023-11-07T17:22:14.585461387Z level=info msg="now listening for http traffic" service=http addr=0.0.0.0:80
ts=2023-11-07T17:22:42.029286763Z level=error msg="Exporting failed. The error is not retryable. Dropping data." component=otelcol.exporter.otlp.default error="Permanent error: rpc error: code = Unimplemented desc = unknown service opentelemetry.proto.collector.metrics.v1.MetricsService" dropped_items=233

System information

K8s v1.26.8+rke2r1

Software version

Grafana Agent v0.34.2

Configuration

# -- Overrides the chart's name. Used to change the infix in the resource names.
nameOverride: null

# -- Overrides the chart's computed fullname. Used to change the full prefix of
# resource names.
fullnameOverride: null

## Global properties for image pulling override the values defined under `image.registry` and `configReloader.image.registry`.
## If you want to override only one image registry, use the specific fields but if you want to override them all, use `global.image.registry`
global:
  image:
    # -- Global image registry to use if it needs to be overriden for some specific use cases (e.g local registries, custom images, ...)
    registry: ""

    # -- Optional set of global image pull secrets.
    pullSecrets: []

  # -- Security context to apply to the Grafana Agent pod.
  podSecurityContext: {}

crds:
  # -- Whether to install CRDs for monitoring.
  create: true

# Various agent settings.
agent:
  # -- Mode to run Grafana Agent in. Can be "flow" or "static".
  mode: "flow"
  configMap:
    # -- Create a new ConfigMap for the config file.
    create: false
    # -- Content to assign to the new ConfigMap.  This is passed into `tpl` allowing for templating from values.
    content: ""

    # -- Name of existing ConfigMap to use. Used when create is false.
    name: grafana-agent-flow-custom-config
    # -- Key in ConfigMap to get config from.
    key: config.river

  clustering:
    # -- Deploy agents in a cluster to allow for load distribution. Only
    # applies when agent.mode=flow.
    enabled: false

  # -- Path to where Grafana Agent stores data (for example, the Write-Ahead Log).
  # By default, data is lost between reboots.
  storagePath: /tmp/agent

  # -- Address to listen for traffic on. 0.0.0.0 exposes the UI to other
  # containers.
  listenAddr: 0.0.0.0

  # -- Port to listen for traffic on.
  listenPort: 80

  # --  Base path where the UI is exposed.
  uiPathPrefix: /

  # -- Enables sending Grafana Labs anonymous usage stats to help improve Grafana
  # Agent.
  enableReporting: false

  # -- Extra environment variables to pass to the agent container.
  extraEnv: []

  # -- Maps all the keys on a ConfigMap or Secret as environment variables. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#envfromsource-v1-core
  envFrom: []

  # -- Extra args to pass to `agent run`: https://grafana.com/docs/agent/latest/flow/reference/cli/run/
  extraArgs: []

  # -- Extra ports to expose on the Agent
  extraPorts:
    - name: grpc-otel
      port: 4317
      targetPort: 4317
      protocol: TCP
    - name: http-otel
      port: 4318
      targetPort: 4318
      protocol: TCP
  # - name: "faro"
  #   port: 12347
  #   targetPort: 12347
  #   protocol: "TCP"

  mounts:
    # -- Mount /var/log from the host into the container for log collection.
    varlog: true
    # -- Mount /var/lib/docker/containers from the host into the container for log
    # collection.
    dockercontainers: true

    # -- Extra volume mounts to add into the Grafana Agent container. Does not
    # affect the watch container.
    extra: []

  # -- Security context to apply to the Grafana Agent container.
  securityContext: {}

  # -- Resource requests and limits to apply to the Grafana Agent container.
  resources: {}

image:
  # -- Grafana Agent image registry (defaults to docker.io)
  registry: "docker.io"
  # -- Grafana Agent image repository.
  repository: grafana/agent
  # -- (string) Grafana Agent image tag. When empty, the Chart's appVersion is
  # used.
  tag: null
  # -- Grafana Agent image's SHA256 digest (either in format "sha256:XYZ" or "XYZ"). When set, will override `image.tag`.
  digest: null
  # -- Grafana Agent image pull policy.
  pullPolicy: IfNotPresent
  # -- Optional set of image pull secrets.
  pullSecrets: []

rbac:
  # -- Whether to create RBAC resources for the agent.
  create: true

serviceAccount:
  # -- Whether to create a service account for the Grafana Agent deployment.
  create: true
  # -- Annotations to add to the created service account.
  annotations: {}
  # -- The name of the existing service account to use when
  # serviceAccount.create is false.
  name: null

# Options for the extra controller used for config reloading.
configReloader:
  # -- Enables automatically reloading when the agent config changes.
  enabled: true
  image:
    # -- Config reloader image registry (defaults to docker.io)
    registry: "docker.io"
    # -- Repository to get config reloader image from.
    repository: jimmidyson/configmap-reload
    # -- Tag of image to use for config reloading.
    tag: v0.8.0
    # -- SHA256 digest of image to use for config reloading (either in format "sha256:XYZ" or "XYZ"). When set, will override `configReloader.image.tag`
    digest: ""
  # -- Override the args passed to the container.
  customArgs: []
  # -- Resource requests and limits to apply to the config reloader container.
  resources:
    requests:
      cpu: "1m"
      memory: "5Mi"
  # -- Security context to apply to the Grafana configReloader container.
  securityContext: {}

controller:
  # -- Type of controller to use for deploying Grafana Agent in the cluster.
  # Must be one of 'daemonset', 'deployment', or 'statefulset'.
  type: "daemonset"

  # -- Number of pods to deploy. Ignored when controller.type is 'daemonset'.
  replicas: 1

  # -- Whether to deploy pods in parallel. Only used when controller.type is
  # 'statefulset'.
  parallelRollout: true

  # -- Configures Pods to use the host network. When set to true, the ports that will be used must be specified.
  hostNetwork: false

  # -- Configures Pods to use the host PID namespace.
  hostPID: false

  # -- Configures the DNS policy for the pod. https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy
  dnsPolicy: ClusterFirst

  # -- Update strategy for updating deployed Pods.
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 100%

  # -- nodeSelector to apply to Grafana Agent pods.
  nodeSelector: {}

  # -- Tolerations to apply to Grafana Agent pods.
  tolerations: []

  # -- priorityClassName to apply to Grafana Agent pods.
  priorityClassName: ""

  # -- Extra pod annotations to add.
  podAnnotations: {}

  # -- Extra pod labels to add.
  podLabels: {}

  # -- Whether to enable automatic deletion of stale PVCs due to a scale down operation, when controller.type is 'statefulset'.
  enableStatefulSetAutoDeletePVC: false

  autoscaling:
    # -- Creates a HorizontalPodAutoscaler for controller type deployment.
    enabled: false
    # -- The lower limit for the number of replicas to which the autoscaler can scale down.
    minReplicas: 1
    # -- The upper limit for the number of replicas to which the autoscaler can scale up.
    maxReplicas: 5
    # -- Average CPU utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetCPUUtilizationPercentage` to 0 will disable CPU scaling.
    targetCPUUtilizationPercentage: 0
    # -- Average Memory utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetMemoryUtilizationPercentage` to 0 will disable Memory scaling.
    targetMemoryUtilizationPercentage: 80

  # -- Affinity configuration for pods.
  affinity: {}

  volumes:
    # -- Extra volumes to add to the Grafana Agent pod.
    extra: []

  # -- volumeClaimTemplates to add when controller.type is 'statefulset'.
  volumeClaimTemplates: []

  ## -- Additional init containers to run.
  ## ref: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
  ##
  initContainers: []

service:
  # -- Creates a Service for the controller's pods.
  enabled: true
  # -- Service type
  type: ClusterIP
  # -- Cluster IP, can be set to None, empty "" or an IP address
  clusterIP: ""
  annotations:
    {}
    # cloud.google.com/load-balancer-type: Internal

serviceMonitor:
  enabled: false
  # -- Additional labels for the service monitor.
  additionalLabels: {}
  # -- Scrape interval. If not set, the Prometheus default scrape interval is used.
  interval: ""
  # -- MetricRelabelConfigs to apply to samples after scraping, but before ingestion.
  # ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#relabelconfig
  metricRelabelings: []
  # - action: keep
  #   regex: 'kube_(daemonset|deployment|pod|namespace|node|statefulset).+'
  #   sourceLabels: [__name__]

  # -- RelabelConfigs to apply to samples before scraping
  # ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#relabelconfig
  relabelings: []
  # - sourceLabels: [__meta_kubernetes_pod_node_name]
  #   separator: ;
  #   regex: ^(.*)$
  #   targetLabel: nodename
  #   replacement: $1
  #   action: replace

ingress:
  # -- Enables ingress for the agent (faro port)
  enabled: false
  # For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName
  # See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress
  # ingressClassName: nginx
  # Values can be templated
  annotations:
    {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  labels: {}
  path: /
  faroPort: 12347

  # pathType is only for k8s >= 1.1=
  pathType: Prefix

  hosts:
    - chart-example.local
  ## Extra paths to prepend to every host configuration. This is useful when working with annotation based services.
  extraPaths: []
  # - path: /*
  #   backend:
  #     serviceName: ssl-redirect
  #     servicePort: use-annotation
  ## Or for k8s > 1.19
  # - path: /*
  #   pathType: Prefix
  #   backend:
  #     service:
  #       name: ssl-redirect
  #       port:
  #         name: use-annotation

  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

Logs

ts=2023-11-07T16:40:03.069011428Z level=error msg="Exporting failed. The error is not retryable. Dropping data." component=otelcol.exporter.otlp.default error="Permanent error: rpc error: code = Unimplemented desc = unknown service opentelemetry.proto.collector.metrics.v1.MetricsService" dropped_items=243
aerfio commented 10 months ago

Diff between my values.yaml and values.yaml from chart@0.27.2, mine red, orignal chart's values as green:

image
aerfio commented 10 months ago

Ups, this might be an issue with the endpoint on my side. Still, improving error message might be beneficial, I didn't see it wrapped with some message like "failed to call external service: XYZ" so I assumed it must be internal grafana-agent error. I'll try to see if I can ensure if that's error on my side, consider this issue on hold for now.

aerfio commented 10 months ago

Ok, ignore that, it was probably error on my side. Closing this issue then