open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.47k stars 1.47k forks source link

Partial success responses from Collector endpoints. #9408

Closed cobar79 closed 9 months ago

cobar79 commented 9 months ago

Describe the bug I am sure it is some kind of configuration of the Collector, but I can't get any diagnostics from Collector. The best I have been able to see is to scrape the Agent logs and post them on the http port.

Posting to logs, traces and metrics all return 200 OK with payload http://localhost:4318/v1/metrics http://localhost:4318/v1/traces http://localhost:4318/v1/logs

{
    "partialSuccess": {}
}

There is no feedback from the Java Agent, just the OLTP payload logs.

Steps to reproduce Using the defaults for Java Agent:

OTEL_SERVICE_NAME=ingest
OTEL_RESOURCE_ATTRIBUTES=service.namespace\=osint,deployment.environment\=local
OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_LOGS_EXPORTER=logging-otlp
OTEL_METRICS_EXPORTER=logging-otlp
OTEL_TRACES_EXPORTER=logging-otlp

Start application and capture payloads (OtlpJsonLoggingMetricExporter, OtlpJsonLoggingSpanExporter, OtlpJsonLoggingLogRecordExporter

{
  "resource": {
    "attributes": [
      {
        "key": "deployment.environment",
        "value": {
          "stringValue": "local"
        }
      },
      {
        "key": "host.arch",
        "value": {
          "stringValue": "amd64"
        }
      },
      {
        "key": "os.description",
        "value": {
          "stringValue": "Windows 10 10.0"
        }
      },
      {
        "key": "process.runtime.description",
        "value": {
          "stringValue": "Amazon.com Inc. OpenJDK 64-Bit Server VM 17.0.7+7-LTS"
        }
      },
      {
        "key": "service.name",
        "value": {
          "stringValue": "ingest"
        }
      },
      {
        "key": "service.namespace",
        "value": {
          "stringValue": "osint"
        }
      },
      {
        "key": "telemetry.auto.version",
        "value": {
          "stringValue": "1.32.0"
        }
      },
      {
        "key": "telemetry.sdk.language",
        "value": {
          "stringValue": "java"
        }
      },
      {
        "key": "telemetry.sdk.name",
        "value": {
          "stringValue": "opentelemetry"
        }
      },
      {
        "key": "telemetry.sdk.version",
        "value": {
          "stringValue": "1.32.0"
        }
      }
    ]
  },
  "scopeMetrics": [
    {
      "scope": {
        "name": "io.opentelemetry.sdk.logs",
        "attributes": []
      },
      "metrics": [
        {
          "name": "processedLogs",
          "description": "The number of logs processed by the BatchLogRecordProcessor. [dropped=true if they were dropped due to high throughput]",
          "unit": "1",
          "sum": {
            "dataPoints": [
              {
                "startTimeUnixNano": "1706308587457273500",
                "timeUnixNano": "1706308647477301900",
                "asInt": "121",
                "exemplars": [],
                "attributes": [
                  {
                    "key": "dropped",
                    "value": {
                      "boolValue": false
                    }
                  },
                  {
                    "key": "processorType",
                    "value": {
                      "stringValue": "BatchLogRecordProcessor"
                    }
                  }
                ]
              }
            ]
        }
      ]
    },
    {
      "scope": {
        "name": "io.opentelemetry.micrometer-1.5",
        "attributes": []
      },
      "metrics": [
        {
          "name": "jvm.gc.live.data.size",
          "description": "Size of long-lived heap memory pool after reclamation",
          "unit": "bytes",
          "gauge": {
            "dataPoints": [
              {
                "startTimeUnixNano": "1706308587457273500",
                "timeUnixNano": "1706308647477301900",
                "asDouble": 0.0,
                "exemplars": [],
                "attributes": []
              }
            ]
          }
        },
        {
          "name": "jvm.threads.states",
          "description": "The current number of threads",
          "unit": "threads",
          "gauge": {
            "dataPoints": [
            ]
          }
        }
      ]
    }
  ],
  "schemaUrl": "https://opentelemetry.io/schemas/1.21.0"
}

Ran in Postman: This is the curl for endpoints with payloads from client logs.

curl --location 'http://localhost:4318/v1/logs' \
   --header 'Content-Type: application/json' \
  --data-raw '{ "resource": {"attributes": [.....]}

Results on HTTP Status 200 OK

{  "partialSuccess": {} }

What did you expect to see? Some type of response from the collector when using the Java Agent over grpc protocol Some type of error log in the collector. Some type of error in the curl.

What did you see instead? Nothing in the Grafana for Tempo, Prometheus or Loki logs.

Collector logs. Not sure it is related since it is at INFO level and for zapgrpc??

lient: \"POST /api/v2/spans HTTP/\""    {"grpc_log": true}
2024-01-26 11:32:09 2024-01-26T18:32:09.739Z    info    zapgrpc/zapgrpc.go:178  [core] [Server #1] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"POST /api/v2/spans HTTP/\"" {"grpc_log": true}
2024-01-26 14:36:44 2024-01-26T21:36:44.322Z    info    zapgrpc/zapgrpc.go:178  [transport] [server-transport 0xc000e4d040] Closing: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"POST /api/v2/logs HTTP/1\""    {"grpc_log": true}
2024-01-26 14:36:44 2024-01-26T21:36:44.322Z    info    zapgrpc/zapgrpc.go:178  [core] [Server #1] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"POST /api/v2/logs HTTP/1\"" {"grpc_log": true}
2024-01-26 14:39:03 2024-01-26T21:39:03.649Z    info    zapgrpc/zapgrpc.go:178  [transport] [server-transport 0xc000e4d040] Closing: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"POST /api/v2/logs HTTP/1\""    {"grpc_log": true}
2024-01-26 14:39:03 2024-01-26T21:39:03.649Z    info    zapgrpc/zapgrpc.go:178  [core] [Server #1] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"POST /api/v2/logs HTTP/1\"" {"grpc_log": true}
2024-01-26 14:48:45 2024-01-26T21:48:45.878Z    info    zapgrpc/zapgrpc.go:178  [transport] [server-transport 0xc000e4d040] Closing: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"POST /api/v2/spans HTTP/\""    {"grpc_log": true}
2024-01-26 14:48:45 2024-01-26T21:48:45.878Z    info    zapgrpc/zapgrpc.go:178  [core] [Server #1] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"POST /api/v2/spans HTTP/\"" {"grpc_log": true}
2024-01-26 14:56:21 2024-01-26T21:56:21.866Z    info    zapgrpc/zapgrpc.go:178  [transport] [server-transport 0xc000e4d040] Closing: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"POST /api/v2/metrics HTT\""    {"grpc_log": true}
2024-01-26 14:56:21 2024-01-26T21:56:21.866Z    info    zapgrpc/zapgrpc.go:178  [core] [Server #1] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"POST /api/v2/metrics HTT\"" {"grpc_log": true}
2024-01-27 09:07:24 2024-01-27T16:07:24.351Z    info    zapgrpc/zapgrpc.go:178  [transport] [client-transport 0xc001000d80] Closing: connection error: desc = "error reading from server: read tcp 192.168.192.8:40624->192.168.192.4:4317: read: connection reset by peer"       {"grpc_log": true}

What version did you use? Spring Boot 3.2.0 opentelemetry-javaagent - version: 1.32.0

What config did you use? Collector

extensions:
  health_check:
  pprof:
  zpages:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-prom-col'
          scrape_interval: 10s
          static_configs:
            - targets: [ '0.0.0.0:8889' ]
          metric_relabel_configs:
            - source_labels: [ __name__ ]
              regex: '.*grpc_io.*'
              action: drop

processors:
  batch:
    send_batch_max_size: 1000
    send_batch_size: 100
    timeout: 10s

######### From processor, export to Logging, tracing and/or metrics servers
exporters:
  logging:
    verbosity: detailed
  #Metrics to Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"
    const_labels:
      label1: osint
  # tracing to tempo
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
  # logging to loki
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

# The Collector pipeline.
service:
  telemetry:
    logs:
      level: debug
      development: true
      sampling:
        initial: 10
        thereafter: 5
      output_paths:
        - stdout
      error_output_paths:
        - stderr
    metrics:
      level: detailed

  extensions: [health_check, pprof, zpages]
  pipelines:
    # for now we only interested about metrics...
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    traces:
      receivers: [ otlp ]
      processors: [ batch ]
      exporters: [otlp]  # name here should match the exporter name for tempo which is otlp
    logs:
      receivers: [otlp]
      exporters: [loki, logging]

prometheus

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']  #docker-compose prometheus port

tempo

#https://github.com/LordMaduz/spring-boot-3-open-telemetey-series/blob/main/tempo/tempo.yml
#https://github.com/ahmadalammar/Spring-Observability/blob/master/docker/tempo/tempo.yml
server:
  http_listen_port: 3200

distributor:
  receivers:                           # this configuration will listen on all ports and protocols that tempo is capable of.
    zipkin:
    jaeger:                            # the receivers all come from the OpenTelemetry collector.  more configuration information can
      protocols:                       # be found there: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver
        thrift_http:                   #
        grpc:                          # for a production deployment you should only enable the receivers you need!
        thrift_binary:
        thrift_compact:
    otlp:
      protocols:
        http:
        grpc:

ingester:
  max_block_duration: 5m               # cut the headblock when this much time passes. this is being set for demo purposes and should probably be left alone normally

compactor:
  compaction:
    block_retention: 1h                # overall Tempo trace retention. set for demo purposes

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: docker-compose
  storage:
    path: /tmp/tempo/generator/wal

storage:
  trace:
    backend: local                     # backend configuration to use
    wal:
      path: /tmp/tempo/wal             # where to store the wal locally
    local:
      path: /tmp/tempo/blocks

overrides:
  metrics_generator_processors: [service-graphs, span-metrics] # enables metrics generator

Environment Windows 10 JDK 17 go 1.21.6

docker compose:

version: '3'

services:

  loki:
    container_name: loki
    image: grafana/loki:latest
    command: [ "-config.file=/etc/loki/local-config.yaml" ]
    ports:
      - "3100:3100"

  prometheus:
    container_name: prometheus
    image: prom/prometheus:latest
    volumes:
      - ${OSI_DOCKER_ROOT}\prometheus\prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    ports:
      - "9090:9090"

  tempo:
    container_name: tempo
    image: grafana/tempo:latest
    command: [ "-config.file=/etc/tempo.yml" ]
    volumes:
      - ${OSI_DOCKER_ROOT}\tempo\tempo.yml:/etc/tempo.yml:ro
      - ${OSI_DOCKER_ROOT}\tempo\tempo-data:/tmp/tempo
    ports:
      - "3200:3200"  # Tempo see tempo.yml http_listen_port
      - "4317"  # otlp grpc

  otel-collector:
    container_name: opentelemetry-collector
    image: otel/opentelemetry-collector-contrib:latest
    restart: always
    command:
      - "--config=/etc/otel/config.yml"
    volumes:
      - ${OSI_DOCKER_ROOT}\collector\otel-collector-config.yml:/etc/otel/config.yml
    ports:
      - 1888:1888 # pprof extension
      - 8889:8889 # Prometheus metrics exposed by the Collector
      - 8890:8890 # Prometheus exporter metrics
      - 13133:13133 # health_check extension
      - 4317:4317 # OTLP gRPC receiver
      - 4318:4318 # OTLP http receiver
      - 55679:55679 # zpages extension
    depends_on:
      - prometheus
      - tempo
      - loki

  grafana:
    container_name: grafana
    image: grafana/grafana:latest
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
      - GF_AUTH_DISABLE_LOGIN_FORM=true
    volumes:
      - ./grafana:/etc/grafana/provisioning/datasources:ro
    ports:
      - 3000:3000
    depends_on:
      - prometheus
      - tempo
      - loki
      - otel-collector

Additional context

TylerHelmuth commented 9 months ago

@cobar79 I'm confused what specific behavior you're expecting. For otlp http a Full Success should return 200 status code with the body being a empty ExportLogsServiceResponse object: https://github.com/open-telemetry/opentelemetry-proto/blob/b5c1a7882180a26bb7794594e8546798ecb68103/opentelemetry/proto/collector/logs/v1/logs_service.proto#L45-L62.

What error were you expecting?

cobar79 commented 9 months ago

@TylerHelmuth All the OTLP requests from application never made it to the backend, there were no logs indicating success or failure. So I thought I would capture the payload and try the HTTP endpoints and see what kind of failure happened (BAD_REQUEST, NOT_FOUND, INTERNAL_SERVER_ERROR). However, all the HTTP requests with the same payload as OTLP returned OK "Partial Success" meaning something failed. Was it received (probably yes since OK), was it a bad payload (probably not since OK), did it fail in the pipeline, did it fail to connect to the backend, did the backend reject the it? I also turned up the debugging level on the collector but didn't see any debug messages, just a non-related "zapgrpc" failure at the INFO level.

TylerHelmuth commented 9 months ago

In the json response you shared:

{
    "partialSuccess": {}
}

partialSuccess is empty, which means there was no partial success: The server MUST leave the partial_success field unset in case of a successful response.

Can you test if the data is making it to the collector by added a debugexporter to the exports in your pipeline?

cobar79 commented 9 months ago

I did unless it is wrong. See otel-collector above:

exporters:
  logging:
    verbosity: detailed
service:
  telemetry:
    logs:
      level: debug
      development: true
      sampling:
        initial: 10
        thereafter: 5
      output_paths:
        - stdout
      error_output_paths:
        - stderr

I didn't see any additional logs in the collector docker image when I pushed via HTTP or OTLP. Does it need to be env variable or system property?

TylerHelmuth commented 9 months ago

You configured it, but you need to add it to the trace pipeline's exporter list.

cobar79 commented 9 months ago

Added not logs:

  pipelines:
    # for now we only interested about metrics...
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus, logging]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp, logging]  # name here should match the exporter name for tempo which is otlp
    logs:
      receivers: [otlp]
      exporters: [loki, logging]
cobar79 commented 9 months ago

Abandoned JavaAgent and went with Manual instrumentation using the Spring Boot 3 framework.

Assume these logs mean the payload made it through to Prometheus.

2024-02-14 11:21:42 2024-02-14T18:21:42.207Zdebugprometheusexporter@v0.91.0/accumulator.go:79accumulating metric: executor.queued{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2024-02-14 11:21:42 2024-02-14T18:21:42.207Zdebugprometheusexporter@v0.91.0/accumulator.go:79accumulating metric: logback.events{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2024-02-14 11:21:42 2024-02-14T18:21:42.207Zdebugprometheusexporter@v0.91.0/accumulator.go:79accumulating metric: process.cpu.usage{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2024-02-14 11:21:42 2024-02-14T18:21:42.207Zdebugprometheusexporter@v0.91.0/accumulator.go:79accumulating metric: rabbitmq.acknowledged_published{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2024-02-14 11:21:42 2024-02-14T18:21:42.207ZinfoMetricsExporter{"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 109, "data points": 109}
2024-02-14 11:21:42 2024-02-14T18:21:42.207ZinfoResourceMetrics #0
2024-02-14 11:21:42 Resource SchemaURL: 
2024-02-14 11:21:42 Resource attributes:
2024-02-14 11:21:42      -> telemetry.sdk.name: Str(io.micrometer)
2024-02-14 11:21:42      -> telemetry.sdk.language: Str(java)
2024-02-14 11:21:42      -> telemetry.sdk.version: Str(1.12.0)
2024-02-14 11:21:42      -> service.name: Str(host-validation)