Open oliverdding opened 1 year ago
After setting SDK logging to -8
level, I found nothing special in logs, no error, or even warning.
Below is a little part of logs:
2023-11-22T11:30:43+08:00 LEVEL(-4) metric/provider.go:61 MeterProvider created {"Resource": {"Attributes":{},"SchemaURL":""}, "Readers": [{}], "Views": 0}
2023-11-22T11:30:43+08:00 LEVEL(-4) trace/provider.go:125 TracerProvider created {"config": {"SpanProcessors":[{}],"SamplerType":"trace.alwaysOnSampler","IDGeneratorType":"*trace.randomIDGenerator","SpanLimits":{"AttributeValueLengthLimit":-1,"AttributeCountLimit":128,"EventCountLimit":128,"LinkCountLimit":128,"AttributePerEventCountLimit":128,"AttributePerLinkCountLimit":128},"Resource":[{"Key":"service.name","Value":{"Type":"STRING","Value":"translator"}},{"Key":"service.namespace","Value":{"Type":"STRING","Value":"tencent_public"}}]}}
2023-11-22T11:30:43+08:00 LEVEL(-4) metric/provider.go:95 Meter created {"Name": "translator", "Version": "", "SchemaURL": ""}
2023-11-22T11:30:43+08:00 LEVEL(-4) trace/provider.go:182 Tracer created {"name": "translator", "version": "", "schemaURL": ""}
2023-11-22T11:31:11+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 512, "total_dropped": 0}
2023-11-22T11:31:11+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 512, "total_dropped": 0}
2023-11-22T11:31:12+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 512, "total_dropped": 0}
2023-11-22T11:31:14+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 512, "total_dropped": 0}
2023-11-22T11:31:19+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 370, "total_dropped": 0}
2023-11-22T11:31:24+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 408, "total_dropped": 0}
2023-11-22T11:31:29+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 372, "total_dropped": 0}
2023-11-22T11:31:34+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 240, "total_dropped": 0}
2023-11-22T11:31:39+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 342, "total_dropped": 0}
2023-11-22T11:31:43+08:00 LEVEL(-8) metric/periodic_reader.go:274 PeriodicReader collection {"Data": {"Resource":[{"Key":"service.name","Value":{"Type":"STRING","Value":"translator"}},{"Key":"service.namespace","Value":{"Type":"STRING","Value":"tencent_public"}}],"ScopeMetrics":null}}
2023-11-22T11:31:43+08:00 LEVEL(-8) otlpmetricgrpc/exporter.go:95 OTLP/gRPC exporter export {"Data": {"Resource":[{"Key":"service.name","Value":{"Type":"STRING","Value":"translator"}},{"Key":"service.namespace","Value":{"Type":"STRING","Value":"tencent_public"}}],"ScopeMetrics":null}}
2023-11-22T11:31:44+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 192, "total_dropped": 0}
2023-11-22T11:31:49+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 396, "total_dropped": 0}
2023-11-22T11:31:54+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 204, "total_dropped": 0}
2023-11-22T11:31:59+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 276, "total_dropped": 0}
2023-11-22T11:32:04+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 300, "total_dropped": 0}
2023-11-22T11:32:09+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 228, "total_dropped": 0}
2023-11-22T11:32:14+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 314, "total_dropped": 0}
2023-11-22T11:32:19+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 310, "total_dropped": 0}
2023-11-22T11:32:24+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 252, "total_dropped": 0}
2023-11-22T11:32:29+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 312, "total_dropped": 0}
2023-11-22T11:32:34+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 252, "total_dropped": 0}
2023-11-22T11:32:39+08:00 LEVEL(-8) trace/batch_span_processor.go:287 exporting spans {"count": 408, "total_dropped": 0}
The batch span processor may drop events if there are too many enqueued. See https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go#L46
Is it possible this is what is happening within your application?
Hi @dmathieu , thanks for your reply! I would give it a look tomorrow, then apply the result here.
Hi @dmathieu , If the queue is full, SDK would drop the span and count on bps.dropped (see https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go#L404). And the SDK logs should print it out (see https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go#L287).
But in the second comment by me, I have set the logger of SDK to an individual file, which shows no dropped spans and no error at all.
Hi @dmathieu , I can't see your reply in github but I do seen it in email.
How have you done it? Are you sure that debug messages are not ignored?
You may have to change the verbosity.
We use zaplog as logging implementation of logr, and we set the log level by:
func logLevelSetter() map[string]Level {
return map[string]zapcore.Level{
"DEBUG": -8,
"INFO": -4,
"WARN": -1,
"ERROR": 0,
}
}
func logInitBalabala() {
...
if logLevel, ok = logLevelSetter()[strings.ToUpper(os.Getenv("OTEL_LOG_LEVEL"))]; !ok {
// If it is not set and does not match, the default value Info will be taken.
logLevel = -1
}
core := zapcore.NewCore(
zapcore.NewConsoleEncoder(cfg.EncoderConfig),
zapcore.NewMultiWriteSyncer(otelLogFileWriter),
zap.NewAtomicLevelAt(logLevel)
)
logger := zap.New(core, zap.AddCaller(), zap.AddCallerSkip(1))
Logger = zapr.NewLogger(logger)
...
}
I'm sure the DEBUG level log are printed out, since the line (https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go#L287) would print out DEBUG message, which is seen in my log file.
Hi @dmathieu , here comes an extraordinary strange thing:
We manually count the spans written by SDK in log, which doesn't match the received amount by collector!?
Well, it's pretty funny...
In order to avoid interference from Jaeger, we annotated all Jaeger SDK and only kept otel SDK. The problem is still there:
753,980 trace are lost, almost 27%.
Hi @dmathieu , I can't see your reply in github but I do seen it in email.
It was my message and I removed it after like 1 minute after I saw your log output
Hi @pellared , copy that. By the way, do you have any advice on this problems? I'm stuck at here for almost one week. XP
From what you wrote it is rather something with the Collector and not the SDK. As far as I remember it is possible to get logs how many spans are processed by the collector. Maybe it is a problem https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter (bad configuration?) that I assume you are using?
Hi @pellared , but the collector's metrics show nothing wrong happened, which have processed and written as many spans as it received.
In fact I am using my self-written clickhouse-exporter at first, and to eliminate this, I switched to opentelemetry-collector-contrib's clickhouse-exporter, but the problem still exists.
I can take another steps to prove it's not the problem of collector, I would deploy a new opentelemetry-collector-contrib instance, which configure https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/countconnector to count how many spans have been received, and will append the result tomorrow.
Hi @pellared , this problem still exists with countcounter
Almost 900'000 spans are lost in the past 12 hours.
Below is the logging from SDK, nothing special at all (total_dropped is 1032 because I restart the collector yesterday so some spans are dropped).
Below is config of collector:
connectors:
count: null
exporters:
clickhouse:
xxx: xxx
extensions:
health_check: {}
pprof:
endpoint: :1777
processors:
batch:
send_batch_max_size: 1000000
send_batch_size: 100000
timeout: 60s
cumulativetodelta: {}
k8sattributes:
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.cronjob.name
- k8s.job.name
- k8s.node.name
- k8s.pod.name
- k8s.pod.uid
- k8s.pod.start_time
passthrough: false
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: connection
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
receivers:
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
http:
endpoint: ${env:MY_POD_IP}:4318
service:
extensions:
- health_check
- pprof
pipelines:
logs:
exporters:
- clickhouse
processors:
- k8sattributes
- batch
- memory_limiter
receivers:
- otlp
metrics:
exporters:
- clickhouse
processors:
- k8sattributes
- batch
- memory_limiter
- cumulativetodelta
receivers:
- count
traces:
exporters:
- count
processors:
- k8sattributes
- batch
- memory_limiter
receivers:
- otlp
telemetry:
logs:
development: false
level: INFO
metrics:
address: ${env:MY_POD_IP}:8888
How about trying to diagnose it e.g. with Wireshark to double-check the communication between the application (OTLP exporter) and the OTel Collector?
Hi @pellared , we just simply change BatchSpanProcessor
to SimpleSpanProcessor
in client side this weekend:
The result is no data are lost now!
This is pretty strange now, as we can't trigger any data lost with BatchSpanProcessor
locally, no matter how concurrency we set...
This definitely still seems to lead towards a queue too big in the batch processor. Have you tried setting a higher max queue size (the default is 2048), or a lower batch export time (default is 5s) to see if there's any behavior change?
Hi @dmathieu , after that, we firstly changed the client from otlptracegrpc
to otlptracehttp
, and at the same time we set a higher options of batch span processor:
tracerProvider = sdktrace.NewTracerProvider(
sdktrace.WithResource(res),
sdktrace.WithSpanProcessor(sdktrace.NewBatchSpanProcessor(
traceExp,
sdktrace.WithMaxExportBatchSize(1024),
sdktrace.WithBatchTimeout(10*time.Second),
sdktrace.WithMaxQueueSize(4096),
)),
sdktrace.WithSampler(sdktrace.AlwaysSample()),
)
Then we found everything work as expected. Trying to figure if this problem is related to grpc or options, then we delete the higher option and use the default value of batch span processor with http exporter, we found everything still work great...
I believe this problem is related to grpc exporter.
We conducted a test by sending 400K spans to the collector using both gRPC and HTTP via telemetrygen. To handle this load, we increased the MaxExportBatchSize
and MaxQueueSize
in the BatchSpanProcessor
package. During the test, we discovered the need to also increase the maximum receive message size for the gRPC receiver and the maximum request body size for the HTTP receiver in the collector. After applying these configuration changes, we successfully sent 400K spans without any data loss.
Do you think its possible change batching implementation to somehow detect if current trace is "incomplete" and either wait for rest of spans a bit, or just "drop" all spans with given traceID?
Basically never allow incomplete trace.
Is this issue may be related #4266 ?
Description
I'm upgrading my production module from jaeger 1.9 (yes, you heard it correctly, it's 1.9) to opentelemetry with collector.
This module would consume Kafka message, handle it then write it back to Kafka, having 11 sub-steps with a parent span named
Consume
. It's correct tracing diagram is as follows:This module would upload jaeger spans and otel spans at the same time, take a sub-step for example:
And I'm sure this module have configure
AlwaysSample
in jaeger and opentelemetr.But it production, opentelemetry would lose many trace and span. For example, in the past weekend, we have 2588620 data been processed, and got 2336808 traces in jaeger, while only 775448 traces in opentelemetry.
I'm sure the collector works as expected:
collector's metrics showed everything work well, no data were refused or dropped, and no error logs were found in collector.
After analysis, I have some on-site evidence:
The amount of data at each stage is almost equal.
I have 8 pods in production, and every pods have been reported, and the report volume is considerable.
Here comes a strange thing: Many trace have no parent span
Consume
(every trace should have it at least). In fact, plenty of trace have only one span in the whole trace, and the existing span are random sub-span in the 11 steps.As you can see, otel seems lose many traces than jaeger, and the existing trace may lose many spans.
I have tried it in development environment with docker compose and demo, resulting no loss happening. :(
Environment
Steps To Reproduce
subCtx
to every substeps, create sub-spans at there:Expected behavior
trace count should be equal (at least almost) to the data count that have been processed, just like jaeger.
I'm not sure if it's a bug in SDK or misusing of SDK, but I have run out of navigation. Thanks you for any advice.