Closed idsvandermolen closed 2 weeks ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Also we seem to observe high memory usage in the OTEL collector when these exceptions occur (we had many exceptions at ~08:30):
Mmm, that is interest on the memory spike. It would be awesome if you're able to grab some pprof samples to help paint a bigger picture of what happened. I would have a guess that a network blip may have caused internal back pressure.
Interesting to note that the exporter isn't respecting the byte size limit, let me see if I can quickly chase that one up for you.
It looks like the issue is upstream from what I can tell, specifically this line
The issue being is if you have a message size of , 1_000_000 bytes (1MB), and your topic limit is also one 1MB, the resulting check locally passes its it check since 1MB > 1MB
returns false, then the broker checks the size and performs 1MB >= 1MB
and denies the message.
Let me see if I can raise an issue on the project and link it back here.
The kafka docs aren't really forth coming on what the comparison should be looking at https://kafka.apache.org/documentation/#topicconfigs_max.message.bytes.
If you don't mind doing an experiment on my behalf, could you set:
producer:
max_message_bytes: 999999
If the errors disappear this confirms the comparison and I can report it upstream to the library.
Note that we have set producer.max_message_bytes: 1000000
in the Kafka exporter while in Kafka broker it is 1 MiB, i.e. 1048588 (https://kafka.apache.org/30/documentation.html#brokerconfigs_message.max.bytes), but I'll ask if we can test with 999999
Hi, we've been running it over the weekend with the max message size set to 999999 and we're still seeing the exact same issue.
otel-agent-746f58cfc4-t8tmg otel-agent 2023-05-22T09:34:45.141Z info exporterhelper/queued_retry.go:434 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "traces", "name": "kafka/spans", "error": "Failed to deliver 1 messages due to kafka server: Message was too large, server rejected it to avoid allocation error", "interval": "15.387055906s"}
otel-agent-746f58cfc4-t8tmg otel-agent 2023-05-22T09:35:00.530Z error exporterhelper/queued_retry.go:176 Exporting failed. No more retries left. Dropping data. {"kind": "exporter", "data_type": "traces", "name": "kafka/spans", "error": "max elapsed time expired Failed to deliver 1 messages due to kafka server: Message was too large, server rejected it to avoid allocation error", "dropped_items": 1208}
otel-agent-746f58cfc4-t8tmg otel-agent go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).onTemporaryFailure
otel-agent-746f58cfc4-t8tmg otel-agent go.opentelemetry.io/collector/exporter@v0.77.0/exporterhelper/queued_retry.go:176
otel-agent-746f58cfc4-t8tmg otel-agent go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
otel-agent-746f58cfc4-t8tmg otel-agent go.opentelemetry.io/collector/exporter@v0.77.0/exporterhelper/queued_retry.go:418
otel-agent-746f58cfc4-t8tmg otel-agent go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send
otel-agent-746f58cfc4-t8tmg otel-agent go.opentelemetry.io/collector/exporter@v0.77.0/exporterhelper/traces.go:137
otel-agent-746f58cfc4-t8tmg otel-agent go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
otel-agent-746f58cfc4-t8tmg otel-agent go.opentelemetry.io/collector/exporter@v0.77.0/exporterhelper/queued_retry.go:206
otel-agent-746f58cfc4-t8tmg otel-agent go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
otel-agent-746f58cfc4-t8tmg otel-agent go.opentelemetry.io/collector/exporter@v0.77.0/exporterhelper/internal/bounded_memory_queue.go:58
Hi, could someone have another look at this issue? Since we noticed that reducing the internal batch size seems to reduce the frequency of this error (but does not eliminate it), we were wondering if the internal batch is split in the Kafka exporter to make sure the message send to Kafka is below the max message size?
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Do not use batch processor, I resolve it by using receiver to kafka exporter directly
Do not use batch processor, I resolve it by using receiver to kafka exporter directly
We will test that. But it should work fine with the batch processor and not drop batches because they are too large (a batch could be split into smaller batches)
We will test that. But it should work fine with the batch processor and not drop batches because they are too large (a batch could be split into smaller batches)
kafkaexporter's sending_queue setting can be configured
Unfortunately removing batch processor could not help. Although without a batch processor the memory usage increases a lot and on the other side it's recommended to use batch processor in order to control memory usage.
When many of these big messages are coming into the pipeline, then the queue gets full on that pod and then we get these errors:
│ 2023-08-11T08:31:44.512Z warn batchprocessor@v0.77.0/batch_processor.go:190 Sender failed {"kind": "processor", "name": "batch", "pipeline": "traces", "error": "sending_queue is full"} │
│ 2023-08-11T08:31:44.548Z warn batchprocessor@v0.77.0/batch_processor.go:190 Sender failed {"kind": "processor", "name": "batch", "pipeline": "traces", "error": "sending_queue is full"}
And increasing sending_queue
could not help much either. When the rate of these big messages are too high, then it's just matter of time that the queue get full on the pod.
We decided to temporary disable retry_on_failure
until this issue get resolved. Disabling retry_on_failure
at least keeps the pipeline up and running and queue does not get full as it immediately drop the big messages.
I see this PR which looks promising. Looking forward to be released
@MovieStoreGuy sorry for the ping, but would you mind having another look at the issue and PR #25144 ? We're actually a bit surprised not more people are running into this issue. The kafka exporter module should break up (split / cut) the batch received so it fits into max_message_bytes. If you have peak traffic with higher volumes it would be very likely the batch in OTEL collector handed over to the kafka exporter module would exceed max_message_bytes. The same issue will happen at some point wether you have batch processor enabled or not, the likelihood mainly depends on traffic volume (spikes), configured resources and batching/queueing settings.
Disabling retry_on_failure
is just a temporary workaround and also disables retries for temporary issues (like network connection drops).
@pavolloffay Sorry for the ping, but would you mind having a look?
I have the same issue:
2023-09-11T12:13:44.130Z info exporterhelper/queued_retry.go:423 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "traces", "name": "kafka", "error": "Failed to deliver 1 messages due to kafka server: Message was too large, server rejected it to avoid allocation error", "interval": "17.602136766s"}
opentelemetry-collector: v0.79.0 config:
exporters:
kafka:
brokers:
- kafka.default:9092
encoding: jaeger_proto
producer:
max_message_bytes: 900000
protocol_version: 2.0.0
sending_queue:
queue_size: 1000000
storage: file_storage
topic: live
I reduced the max_message_bytes from the default to 900.000 to confirm the theory.
Kafka version: 3.3.2
max.request.size=1048576
message.max.bytes=1000012
after some time this error occurs too:
2023-09-11T12:51:33.586Z error exporterhelper/queued_retry.go:174 Exporting failed. Putting back to the end of the queue. {"kind": "exporter", "data_type": "traces", "name": "kafka", "error": "max elapsed time expired Failed to deliver 1 messages due to kafka server: Message was too large, server rejected it to avoid allocation error"}
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).onTemporaryFailure
go.opentelemetry.io/collector/exporter@v0.79.0/exporterhelper/queued_retry.go:174
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector/exporter@v0.79.0/exporterhelper/queued_retry.go:407
go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send
go.opentelemetry.io/collector/exporter@v0.79.0/exporterhelper/traces.go:126
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
go.opentelemetry.io/collector/exporter@v0.79.0/exporterhelper/queued_retry.go:195
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue).StartConsumers.func1
go.opentelemetry.io/collector/exporter@v0.79.0/exporterhelper/internal/persistent_queue.go:55
I think my issue occurs because of a poison message which will be retried forever... That´s why I removed and recreated the underlying pv and pod to check the result. The pod then wrote no exporting failures again while other replicas still did.
Seems that in my case normally it works. Nevertheless something occured once too big messages for Kafka and the collector can´t handle it
Any updates on this topic? The PR is closed but not merged. Problem still persists.
I'm also experiencing this issue. Increasing the number of threads and instances helped me to reduce its occurrence but keeps happening.
We also have the same issue
I am assuming this hasn't been merged due to the construction of the exporter helpers. but could you clarify on what's the status of this @MovieStoreGuy ?
Hi @MovieStoreGuy, sorry for the pin. Just wonder is there any update on this issue? We also have the same problem.
Hi, we are also facing the same issue
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
We also have the same issue
enabling compression on producer side has helped the situation for us.
Config sample below:
exporters:
kafka:
timeout: "5s"
protocol_version: 2.0.0
topic: otlp_spans
encoding: otlp_proto
brokers:
- kafka.aarvee.svc:9092
client_id: "controller_broker_client"
auth:
sasl:
username: "kafkatest"
password: "supersecret"
mechanism: "PLAIN"
metadata:
full: true
retry:
max: 1
backoff: "250ms"
retry_on_failure:
enabled: true
initial_interval: "1s"
max_interval: "3s"
max_elapsed_time: "30s"
sending_queue:
enabled: true
num_consumers: 20
queue_size: 5000000
producer:
max_message_bytes: 109657600
#flush_max_messages: 5
compression: gzip
required_acks: 1
zstd is lot more efficient, but compute heavy and requires kafka protocol v2.1
We use snappy compression also have the same issue,
same issue with snappy
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
Component(s)
exporter/kafka
What happened?
Description
We had an initial setup of an OTLP receiver receiving spans and writing them to a Kafka topic via Kafka exporter returning many errors like:
We have a simple configuration with many default settings (see below). Then we reduced the batch size with:
which greatly reduced the amount of errors we saw, because the internal batch size got smaller. However, we still see them occur now and then. This leads us to believe that the internal batches are not split/processed to accommodate the
producer.max_message_bytes
settingSteps to Reproduce
Use default settings for
batch
processor,kafka
exporter and default 1MB max message size in Kafka cluster. Then generate many spans to make sure the internal collector batches exceed 1MB.Expected Result
We expect the internal batch to be split if it exceeds
producer.max_message_bytes
to prevent errors and spans being dropped eventually.Actual Result
producer.max_message_bytes
does not seem to have any effect and the Kafka exporter tries to send larger batches to Kafka.Collector version
0.77.0
Environment information
Environment
OS: GKE v1.23.16-gke.1400
OpenTelemetry Collector configuration
Log output
Additional context
No response