open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.75k stars 2.18k forks source link

[exporter/clickhouse] exported data is missing, the amount is not match #33923

Open Lincyaw opened 2 weeks ago

Lincyaw commented 2 weeks ago

Component(s)

exporter/clickhouse

What happened?

Description

I use opentelemetry-collector-contrib:0.104.0 version, with clickhouse exporter. But the data amount is mismatch compared to the reported data.

Steps to Reproduce

Use the following docker compose file to start a instance

services:
  opentelemetry-collector:
    image: otel/opentelemetry-collector-contrib:latest
    container_name: opentelemetry-collector
    ports:
      - "4317:4317"
    volumes:
      - ./otel-config.yml:/etc/otel-config.yml
    command: ["--config=/etc/otel-config.yml"]
    depends_on:
      clickhouse:
        condition: service_healthy

  clickhouse:
    image: clickhouse/clickhouse-server:latest
    container_name: clickhouse
    ports:
      - "8123:8123"
      - "9000:9000"
    environment:
      - CLICKHOUSE_DB=db
      - CLICKHOUSE_USER=default
      - CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1
      - CLICKHOUSE_PASSWORD=password
    volumes:
      - clickhouse_data:/var/lib/clickhouse
    healthcheck:
      test:
        [
          "CMD",
          "wget",
          "--spider",
          "-q",
          "0.0.0.0:8123/ping"
        ]
      interval: 30s
      timeout: 5s
      retries: 3
    ulimits:
      nproc: 65535
      nofile:
        soft: 262144
        hard: 262144

volumes:
  clickhouse_data:

This otel config is used, place these two files together, then run docker compose up

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
processors:
  batch:
    timeout: 3s
    send_batch_size: 100000

exporters:
  debug:
    verbosity: normal
  clickhouse:
    endpoint: tcp://clickhouse:9000?dial_timeout=10s&compress=lz4&username=default&password=password
    database: default
    ttl: 0
    logs_table_name: otel_logs
    traces_table_name: otel_traces
    metrics_table_name: otel_metrics
    timeout: 5s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

extensions:
  health_check:
  pprof:
  zpages:

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, clickhouse]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, clickhouse]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, clickhouse]

Then, use a golang code, to send data:

package main

import (
    "context"
    "fmt"
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
    "google.golang.org/protobuf/proto"
    "log"
    "os"
    "time"

    metricpb "go.opentelemetry.io/proto/otlp/collector/metrics/v1"
    pb "go.opentelemetry.io/proto/otlp/metrics/v1"
)

func sendRequest(ctx context.Context, metricData *pb.ResourceMetrics) int {
    client, err := grpc.NewClient("10.10.10.29:4317", grpc.WithTransportCredentials(insecure.NewCredentials()))
    if err != nil {
        log.Fatalf("Failed to connect: %v", err)
    }
    defer client.Close()

    count := 0
    for _, v := range metricData.ScopeMetrics {
        for _, vv := range v.Metrics {
            count += len(vv.GetGauge().DataPoints)
        }
    }

    metricClient := metricpb.NewMetricsServiceClient(client)
    resp, err := metricClient.Export(ctx, &metricpb.ExportMetricsServiceRequest{
        ResourceMetrics: []*pb.ResourceMetrics{
            metricData,
        },
    })
    if err != nil {
        log.Fatalf("Failed to send metrics: %v", err)
    }
    fmt.Println(resp)
    return count
}

func main() {
    data, err := os.ReadFile("data.pb")
    if err != nil {
        log.Fatalf("Failed to read file: %v", err)
    }
    var metricData pb.MetricsData
    if err := proto.Unmarshal(data, &metricData); err != nil {
        log.Fatalf("Failed to unmarshal data: %v", err)
    }
    total := 0
    for _, resource := range metricData.ResourceMetrics {
        cnt := sendRequest(context.Background(), resource)
        fmt.Println("send ", cnt, " data points")
        total += cnt
        time.Sleep(1 * time.Second)
    }
    fmt.Println("send total ", total, " data points")
}

The data.pb is attached in dropbox, please download it.

go.mod is:

module awesomeProject

go 1.22

require (
    go.opentelemetry.io/proto/otlp v1.3.1
    google.golang.org/grpc v1.65.0
    google.golang.org/protobuf v1.34.2
)

require (
    github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0 // indirect
    golang.org/x/net v0.25.0 // indirect
    golang.org/x/sys v0.20.0 // indirect
    golang.org/x/text v0.15.0 // indirect
    google.golang.org/genproto/googleapis/api v0.0.0-20240528184218-531527333157 // indirect
    google.golang.org/genproto/googleapis/rpc v0.0.0-20240528184218-531527333157 // indirect
)

Then run the go file. The data will be sent to the otel collector, then to clickhouse.

Expected Result

The amount should exactly equal to 1380344.

Actual Result

It is unstable. Sometimes less, sometimes normal

image

Collector version

0.104.0

Environment information

Environment

OS: debian sid

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
processors:
  batch:
    timeout: 3s
    send_batch_size: 100000

exporters:
  debug:
    verbosity: normal
  clickhouse:
    endpoint: tcp://clickhouse:9000?dial_timeout=10s&compress=lz4&username=default&password=password
    database: default
    ttl: 0
    logs_table_name: otel_logs
    traces_table_name: otel_traces
    metrics_table_name: otel_metrics
    timeout: 5s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

extensions:
  health_check:
  pprof:
  zpages:

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, clickhouse]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, clickhouse]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, clickhouse]

Log output

No response

Additional context

No response

github-actions[bot] commented 2 weeks ago

Pinging code owners:

Lincyaw commented 2 weeks ago

The file is in : https://www.dropbox.com/scl/fi/5qwundoh49u1q8eoa886g/data.pb?rlkey=31us9b9pg9y9iw0sfjcrwodmh&st=71k15527&dl=0

SpencerTorres commented 1 week ago

Hello! Thanks for the detailed issue and sample data.

Are the results correct with a different exporter? If you wrote the lines to a file, would they match up in that case? I want to make sure this isn't an issue with ClickHouse.

Also it looks like you're counting from metrics. You should check to confirm that the metrics are not being grouped at any point. Possible points where they could be summed/grouped:

Also check the exporter logs to see if everything is being exported correctly (no dropped/failed batches). Again, to confirm where this may be happening, you should do another test where you write to ClickHouse and another exporter.