Using `otelcol.receiver.prometheus` in flow mode leads to "Permanent error: rpc error: code = Unimplemented" #5730

Closed aerfio closed 10 months ago

aerfio commented 10 months ago

What's wrong?

Following documentation for otelcol.receiver.prometheus component leads to errors in grafana-agent deployed on K8s cluster using newest chart (0.27.2 as of writing this issue):

ts=2023-11-07T17:18:03.068655246Z level=error msg="Exporting failed. The error is not retryable. Dropping data." component=otelcol.exporter.otlp.default error="Permanent error: rpc error: code = Unimplemented desc = unknown service opentelemetry.proto.collector.metrics.v1.MetricsService" dropped_items=243

Steps to reproduce

Install Grafana Agent on k8s cluster with provided values.yaml (in which I've set the chart to use manually created configmap) + with following configmap

apiVersion: v1
  config.river: |
    prometheus.scrape "default" {
        // Collect metrics from Grafana Agent's default HTTP listen address.
        targets = [{"__address__"   = ""}]

        forward_to = [otelcol.receiver.prometheus.default.receiver]

    otelcol.receiver.prometheus "default" {
      output {
        metrics = [otelcol.exporter.otlp.default.input]

    otelcol.exporter.otlp "default" {
      client {
        endpoint = "REDACTED:4317"
    logging {
      level = "debug"
kind: ConfigMap
  creationTimestamp: null
  name: grafana-agent-flow-custom-config
  namespace: grafana-agent

In comparison to I've only changed the port in prometheus.scrape from 12345 to 80 (because those metrics are exposed there, just checked!) and the endpoint has been redacted and logging.level=debug has been added. This leads to:

ts=2023-11-07T17:22:14.579136625Z level=info "boringcrypto enabled"=false
ts=2023-11-07T17:22:14.581198873Z level=info msg="starting complete graph evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e
ts=2023-11-07T17:22:14.581263174Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=otel duration=5.887µs
ts=2023-11-07T17:22:14.582303293Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=otelcol.exporter.otlp.default duration=1.010948ms
ts=2023-11-07T17:22:14.582484722Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=otelcol.receiver.prometheus.default duration=140.479µs
ts=2023-11-07T17:22:14.582535625Z level=info msg="applying non-TLS config to HTTP server" service=http
ts=2023-11-07T17:22:14.582551168Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=http duration=34.725µs
ts=2023-11-07T17:22:14.582580789Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=ui duration=9.577µs
ts=2023-11-07T17:22:14.58261835Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=cluster duration=14.208µs
ts=2023-11-07T17:22:14.583548212Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=prometheus.scrape.default duration=901.652µs
ts=2023-11-07T17:22:14.583611058Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=tracing duration=27.244µs
ts=2023-11-07T17:22:14.583705853Z level=info msg="finished node evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e node_id=logging duration=66.921µs
ts=2023-11-07T17:22:14.583724099Z level=info msg="finished complete graph evaluation" controller_id="" trace_id=e1374db0e1c1c1ed1efb135bf4d9669e duration=2.804427ms
ts=2023-11-07T17:22:14.583747887Z level=debug msg="changing node state" from=viewer to=participant
ts=2023-11-07T17:22:14.583767861Z level=debug msg="grafana-agent-flow-87nv8 @1: participant"
ts=2023-11-07T17:22:14.58387949Z level=info msg="scheduling loaded components and services"
ts=2023-11-07T17:22:14.584327712Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.default duration=274.951µs
ts=2023-11-07T17:22:14.584376857Z level=info msg="finished node evaluation" controller_id="" node_id=otelcol.receiver.prometheus.default duration=416.025µs
ts=2023-11-07T17:22:14.584267203Z level=debug msg="scheduling components" component=otelcol.exporter.otlp.default count=3
ts=2023-11-07T17:22:14.584321171Z level=info msg="starting cluster node" peers="" advertise_addr=
ts=2023-11-07T17:22:14.584524942Z level=debug msg="passed new targets to scrape manager" component=prometheus.scrape.default
ts=2023-11-07T17:22:14.584532304Z level=debug msg="grafana-agent-flow-87nv8 @3: participant"
ts=2023-11-07T17:22:14.584751801Z level=info msg="peers changed" new_peers=grafana-agent-flow-87nv8
ts=2023-11-07T17:22:14.585461387Z level=info msg="now listening for http traffic" service=http addr=
ts=2023-11-07T17:22:42.029286763Z level=error msg="Exporting failed. The error is not retryable. Dropping data." component=otelcol.exporter.otlp.default error="Permanent error: rpc error: code = Unimplemented desc = unknown service opentelemetry.proto.collector.metrics.v1.MetricsService" dropped_items=233

System information

K8s v1.26.8+rke2r1

Software version

Grafana Agent v0.34.2


ts=2023-11-07T16:40:03.069011428Z level=error msg="Exporting failed. The error is not retryable. Dropping data." component=otelcol.exporter.otlp.default error="Permanent error: rpc error: code = Unimplemented desc = unknown service opentelemetry.proto.collector.metrics.v1.MetricsService" dropped_items=243
aerfio commented 10 months ago

aerfio commented 10 months ago

Ups, this might be an issue with the endpoint on my side. Still, improving error message might be beneficial, I didn't see it wrapped with some message like "failed to call external service: XYZ" so I assumed it must be internal grafana-agent error. I'll try to see if I can ensure if that's error on my side, consider this issue on hold for now.

aerfio commented 10 months ago

Ok, ignore that, it was probably error on my side. Closing this issue then