open-telemetry / opentelemetry-go

OpenTelemetry Go API and SDK
https://opentelemetry.io/docs/languages/go
Apache License 2.0
5.28k stars 1.08k forks source link

randomly lose span in production #4727

Open oliverdding opened 11 months ago

oliverdding commented 11 months ago

Description

I'm upgrading my production module from jaeger 1.9 (yes, you heard it correctly, it's 1.9) to opentelemetry with collector.

This module would consume Kafka message, handle it then write it back to Kafka, having 11 sub-steps with a parent span named Consume. It's correct tracing diagram is as follows:

image

This module would upload jaeger spans and otel spans at the same time, take a sub-step for example:

jaegerSpan, _ := opentracing.StartSpanFromContext(ctx, "UnpackKafkaItem")
defer jaegerSpan.Finish()

_, span := otel.Tracer.Start(ctx, "UnpackKafkaItem", trace.WithSpanKind(trace.SpanKindConsumer))
defer span.End()

And I'm sure this module have configure AlwaysSample in jaeger and opentelemetr.

But it production, opentelemetry would lose many trace and span. For example, in the past weekend, we have 2588620 data been processed, and got 2336808 traces in jaeger, while only 775448 traces in opentelemetry.

image

image

I'm sure the collector works as expected:

image

collector's metrics showed everything work well, no data were refused or dropped, and no error logs were found in collector.

After analysis, I have some on-site evidence:

  1. The amount of data at each stage is almost equal.

    image

  2. I have 8 pods in production, and every pods have been reported, and the report volume is considerable.

    image
  3. Here comes a strange thing: Many trace have no parent span Consume(every trace should have it at least). In fact, plenty of trace have only one span in the whole trace, and the existing span are random sub-span in the 11 steps.

    image

    image

    image

As you can see, otel seems lose many traces than jaeger, and the existing trace may lose many spans.


I have tried it in development environment with docker compose and demo, resulting no loss happening. :(

image

Environment

Steps To Reproduce

  1. Initialize otel by this function:
// Package otel contains functions for initializing otel
var (
    // Meter is the global meter for different purpose of metrics uploading
    Meter metric.Meter = noop.NewMeterProvider().Meter("DefaultMeter")
    // Tracer is the global meter for different purpose of traces uploading
    Tracer trace.Tracer = trace.NewNoopTracerProvider().Tracer("DefaultTracer")
)

// preferDeltaTemporalitySelector prefer `delta` metrics temporality than `cumulative`, since `delta` are smaller and easier to present.
func preferDeltaTemporalitySelector(kind sdkmetric.InstrumentKind) metricdata.Temporality {
    switch kind {
    case sdkmetric.InstrumentKindCounter,
        sdkmetric.InstrumentKindObservableCounter,
        sdkmetric.InstrumentKindHistogram:
        return metricdata.DeltaTemporality
    default:
        return metricdata.CumulativeTemporality
    }
}

// InitProvider would initialize otel provider and register them globally
func InitProvider(ctx context.Context) func() {
    zaplog.Logger.Info("initialize otel")

    res, err := resource.New(ctx,
        resource.WithFromEnv(),
        resource.WithAttributes(
            semconv.ServiceName(env.ServiceName),
            semconv.ServiceNamespace(env.CurrentCluster),
        ),
    )
    if err != nil {
        zaplog.Logger.Error("init resource of otel failed", zaplog.Error(err))
    }

    otelAgentAddr, ok := os.LookupEnv("OTEL_EXPORTER_OTLP_ENDPOINT")
    if !ok {
        otelAgentAddr = "127.0.0.1:4317"
    }

    var metricProvider *sdkmetric.MeterProvider
    if metricExp, err := otlpmetricgrpc.New(
        ctx,
        otlpmetricgrpc.WithInsecure(),
        otlpmetricgrpc.WithEndpoint(otelAgentAddr),
        otlpmetricgrpc.WithCompressor("gzip"),
        otlpmetricgrpc.WithTemporalitySelector(preferDeltaTemporalitySelector),
    ); err != nil {
        zaplog.Logger.Error("failed to create metric exporter", zap.Error(err))
    } else {
        metricProvider = sdkmetric.NewMeterProvider(
            sdkmetric.WithResource(res),
            sdkmetric.WithReader(
                sdkmetric.NewPeriodicReader(
                    metricExp,
                ),
            ),
        )
        otel.SetMeterProvider(metricProvider)
    }

    var tracerProvider *sdktrace.TracerProvider
    if traceExp, err := otlptracegrpc.New(
        ctx,
        otlptracegrpc.WithInsecure(),
        otlptracegrpc.WithEndpoint(otelAgentAddr),
        otlptracegrpc.WithCompressor("gzip"),
    ); err != nil {
        zaplog.Logger.Error("failed to create trace exporter", zap.Error(err))
    } else {
        tracerProvider = sdktrace.NewTracerProvider(
            sdktrace.WithResource(res),
            sdktrace.WithSpanProcessor(sdktrace.NewBatchSpanProcessor(traceExp)),
            sdktrace.WithSampler(sdktrace.AlwaysSample()),
        )
        otel.SetTracerProvider(tracerProvider)
    }

    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))

    Meter = otel.Meter(env.ServiceName)
    Tracer = otel.Tracer(env.ServiceName)

    // shutdown hook for gracefully push all telemetry data to collector before shutdown
    return func() {
        cxt, cancel := context.WithTimeout(ctx, 30*time.Second)
        defer cancel()
        if err := tracerProvider.Shutdown(cxt); err != nil {
            otel.Handle(err)
        }
        if err := metricProvider.Shutdown(cxt); err != nil {
            otel.Handle(err)
        }
    }
}
  1. create the parent span when data comes:
jaegerSpan, subCtx := opentracing.StartSpanFromContext(context.Ctx, "Consume")
defer jaegerSpan.Finish()

subCtx, span := otel.Tracer.Start(subCtx, "Consume")
defer span.End()
  1. then pass subCtx to every substeps, create sub-spans at there:
jaegerSpan, _ := opentracing.StartSpanFromContext(ctx, "UnpackKafkaItem")
defer jaegerSpan.Finish()

_, span := otel.Tracer.Start(ctx, "UnpackKafkaItem", trace.WithSpanKind(trace.SpanKindConsumer))
defer span.End()

Expected behavior

trace count should be equal (at least almost) to the data count that have been processed, just like jaeger.


I'm not sure if it's a bug in SDK or misusing of SDK, but I have run out of navigation. Thanks you for any advice.

oliverdding commented 11 months ago

After setting SDK logging to -8 level, I found nothing special in logs, no error, or even warning.

Below is a little part of logs:

2023-11-22T11:30:43+08:00  LEVEL(-4)  metric/provider.go:61  MeterProvider created  {"Resource": {"Attributes":{},"SchemaURL":""}, "Readers": [{}], "Views": 0}
2023-11-22T11:30:43+08:00  LEVEL(-4)  trace/provider.go:125  TracerProvider created  {"config": {"SpanProcessors":[{}],"SamplerType":"trace.alwaysOnSampler","IDGeneratorType":"*trace.randomIDGenerator","SpanLimits":{"AttributeValueLengthLimit":-1,"AttributeCountLimit":128,"EventCountLimit":128,"LinkCountLimit":128,"AttributePerEventCountLimit":128,"AttributePerLinkCountLimit":128},"Resource":[{"Key":"service.name","Value":{"Type":"STRING","Value":"translator"}},{"Key":"service.namespace","Value":{"Type":"STRING","Value":"tencent_public"}}]}}
2023-11-22T11:30:43+08:00  LEVEL(-4)  metric/provider.go:95  Meter created  {"Name": "translator", "Version": "", "SchemaURL": ""}
2023-11-22T11:30:43+08:00  LEVEL(-4)  trace/provider.go:182  Tracer created  {"name": "translator", "version": "", "schemaURL": ""}
2023-11-22T11:31:11+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 512, "total_dropped": 0}
2023-11-22T11:31:11+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 512, "total_dropped": 0}
2023-11-22T11:31:12+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 512, "total_dropped": 0}
2023-11-22T11:31:14+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 512, "total_dropped": 0}
2023-11-22T11:31:19+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 370, "total_dropped": 0}
2023-11-22T11:31:24+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 408, "total_dropped": 0}
2023-11-22T11:31:29+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 372, "total_dropped": 0}
2023-11-22T11:31:34+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 240, "total_dropped": 0}
2023-11-22T11:31:39+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 342, "total_dropped": 0}
2023-11-22T11:31:43+08:00  LEVEL(-8)  metric/periodic_reader.go:274  PeriodicReader collection  {"Data": {"Resource":[{"Key":"service.name","Value":{"Type":"STRING","Value":"translator"}},{"Key":"service.namespace","Value":{"Type":"STRING","Value":"tencent_public"}}],"ScopeMetrics":null}}
2023-11-22T11:31:43+08:00  LEVEL(-8)  otlpmetricgrpc/exporter.go:95  OTLP/gRPC exporter export  {"Data": {"Resource":[{"Key":"service.name","Value":{"Type":"STRING","Value":"translator"}},{"Key":"service.namespace","Value":{"Type":"STRING","Value":"tencent_public"}}],"ScopeMetrics":null}}
2023-11-22T11:31:44+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 192, "total_dropped": 0}
2023-11-22T11:31:49+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 396, "total_dropped": 0}
2023-11-22T11:31:54+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 204, "total_dropped": 0}
2023-11-22T11:31:59+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 276, "total_dropped": 0}
2023-11-22T11:32:04+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 300, "total_dropped": 0}
2023-11-22T11:32:09+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 228, "total_dropped": 0}
2023-11-22T11:32:14+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 314, "total_dropped": 0}
2023-11-22T11:32:19+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 310, "total_dropped": 0}
2023-11-22T11:32:24+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 252, "total_dropped": 0}
2023-11-22T11:32:29+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 312, "total_dropped": 0}
2023-11-22T11:32:34+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 252, "total_dropped": 0}
2023-11-22T11:32:39+08:00  LEVEL(-8)  trace/batch_span_processor.go:287  exporting spans  {"count": 408, "total_dropped": 0}
dmathieu commented 11 months ago

The batch span processor may drop events if there are too many enqueued. See https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go#L46

Is it possible this is what is happening within your application?

oliverdding commented 11 months ago

Hi @dmathieu , thanks for your reply! I would give it a look tomorrow, then apply the result here.

oliverdding commented 11 months ago

Hi @dmathieu , If the queue is full, SDK would drop the span and count on bps.dropped (see https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go#L404). And the SDK logs should print it out (see https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go#L287).

But in the second comment by me, I have set the logger of SDK to an individual file, which shows no dropped spans and no error at all.

oliverdding commented 11 months ago

Hi @dmathieu , I can't see your reply in github but I do seen it in email.

How have you done it? Are you sure that debug messages are not ignored?

See: https://github.com/open-telemetry/opentelemetry-go/blob/47ba653e6962948dab3a481231c469387f11f184/internal/global/internal_logging.go#L62

You may have to change the verbosity.

We use zaplog as logging implementation of logr, and we set the log level by:

func logLevelSetter() map[string]Level {
    return map[string]zapcore.Level{
        "DEBUG": -8,
        "INFO":  -4,
        "WARN":  -1,
        "ERROR": 0,
    }
}

func logInitBalabala() {
    ...
    if logLevel, ok = logLevelSetter()[strings.ToUpper(os.Getenv("OTEL_LOG_LEVEL"))]; !ok {
        // If it is not set and does not match, the default value Info will be taken.
        logLevel = -1
    }

    core := zapcore.NewCore(
        zapcore.NewConsoleEncoder(cfg.EncoderConfig),
        zapcore.NewMultiWriteSyncer(otelLogFileWriter),
        zap.NewAtomicLevelAt(logLevel)
    )
    logger := zap.New(core, zap.AddCaller(), zap.AddCallerSkip(1))
    Logger = zapr.NewLogger(logger)
    ...
}

I'm sure the DEBUG level log are printed out, since the line (https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go#L287) would print out DEBUG message, which is seen in my log file.

oliverdding commented 11 months ago

Hi @dmathieu , here comes an extraordinary strange thing:

aaa

We manually count the spans written by SDK in log, which doesn't match the received amount by collector!?

Well, it's pretty funny...

oliverdding commented 11 months ago

In order to avoid interference from Jaeger, we annotated all Jaeger SDK and only kept otel SDK. The problem is still there:

  1. data been proceed
image
  1. otel trace
image

753,980 trace are lost, almost 27%.

pellared commented 11 months ago

Hi @dmathieu , I can't see your reply in github but I do seen it in email.

It was my message and I removed it after like 1 minute after I saw your log output

oliverdding commented 11 months ago

Hi @pellared , copy that. By the way, do you have any advice on this problems? I'm stuck at here for almost one week. XP

pellared commented 11 months ago

From what you wrote it is rather something with the Collector and not the SDK. As far as I remember it is possible to get logs how many spans are processed by the collector. Maybe it is a problem https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter (bad configuration?) that I assume you are using?

oliverdding commented 11 months ago

Hi @pellared , but the collector's metrics show nothing wrong happened, which have processed and written as many spans as it received.

In fact I am using my self-written clickhouse-exporter at first, and to eliminate this, I switched to opentelemetry-collector-contrib's clickhouse-exporter, but the problem still exists.

I can take another steps to prove it's not the problem of collector, I would deploy a new opentelemetry-collector-contrib instance, which configure https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/countconnector to count how many spans have been received, and will append the result tomorrow.

oliverdding commented 11 months ago

Hi @pellared , this problem still exists with countcounter

image image

Almost 900'000 spans are lost in the past 12 hours.


Below is the logging from SDK, nothing special at all (total_dropped is 1032 because I restart the collector yesterday so some spans are dropped).

otel.log


Below is config of collector:

connectors:
  count: null
exporters:
  clickhouse:
    xxx: xxx 
extensions:
  health_check: {}
  pprof:
    endpoint: :1777
processors:
  batch:
    send_batch_max_size: 1000000
    send_batch_size: 100000
    timeout: 60s
  cumulativetodelta: {}
  k8sattributes:
    extract:
      metadata:
      - k8s.namespace.name
      - k8s.deployment.name
      - k8s.statefulset.name
      - k8s.daemonset.name
      - k8s.cronjob.name
      - k8s.job.name
      - k8s.node.name
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.pod.start_time
    passthrough: false
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: connection
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 25
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: ${env:MY_POD_IP}:4317
      http:
        endpoint: ${env:MY_POD_IP}:4318
service:
  extensions:
  - health_check
  - pprof
  pipelines:
    logs:
      exporters:
      - clickhouse
      processors:
      - k8sattributes
      - batch
      - memory_limiter
      receivers:
      - otlp
    metrics:
      exporters:
      - clickhouse
      processors:
      - k8sattributes
      - batch
      - memory_limiter
      - cumulativetodelta
      receivers:
      - count
    traces:
      exporters:
      - count
      processors:
      - k8sattributes
      - batch
      - memory_limiter
      receivers:
      - otlp
  telemetry:
    logs:
      development: false
      level: INFO
    metrics:
      address: ${env:MY_POD_IP}:8888
pellared commented 11 months ago

How about trying to diagnose it e.g. with Wireshark to double-check the communication between the application (OTLP exporter) and the OTel Collector?

oliverdding commented 11 months ago

Hi @pellared , we just simply change BatchSpanProcessor to SimpleSpanProcessor in client side this weekend:

image

The result is no data are lost now!

Screenshot 2023-11-27 at 10 34 06

This is pretty strange now, as we can't trigger any data lost with BatchSpanProcessor locally, no matter how concurrency we set...

dmathieu commented 11 months ago

This definitely still seems to lead towards a queue too big in the batch processor. Have you tried setting a higher max queue size (the default is 2048), or a lower batch export time (default is 5s) to see if there's any behavior change?

oliverdding commented 11 months ago

Hi @dmathieu , after that, we firstly changed the client from otlptracegrpc to otlptracehttp, and at the same time we set a higher options of batch span processor:

tracerProvider = sdktrace.NewTracerProvider(
    sdktrace.WithResource(res),
    sdktrace.WithSpanProcessor(sdktrace.NewBatchSpanProcessor(
        traceExp,
        sdktrace.WithMaxExportBatchSize(1024),
        sdktrace.WithBatchTimeout(10*time.Second),
        sdktrace.WithMaxQueueSize(4096),
    )),
    sdktrace.WithSampler(sdktrace.AlwaysSample()),
)

Then we found everything work as expected. Trying to figure if this problem is related to grpc or options, then we delete the higher option and use the default value of batch span processor with http exporter, we found everything still work great...

image

a

I believe this problem is related to grpc exporter.

VinozzZ commented 2 months ago

We conducted a test by sending 400K spans to the collector using both gRPC and HTTP via telemetrygen. To handle this load, we increased the MaxExportBatchSize and MaxQueueSize in the BatchSpanProcessor package. During the test, we discovered the need to also increase the maximum receive message size for the gRPC receiver and the maximum request body size for the HTTP receiver in the collector. After applying these configuration changes, we successfully sent 400K spans without any data loss.

buger commented 2 months ago

Do you think its possible change batching implementation to somehow detect if current trace is "incomplete" and either wait for rest of spans a bit, or just "drop" all spans with given traceID?

Basically never allow incomplete trace.

okhowang commented 2 weeks ago

Is this issue may be related #4266 ?