Closed yanamg7 closed 6 months ago
This is the full output log otel_log.txt
Pinging code owners for connector/routing: @jpkrohling @mwear. See Adding Labels via Comments if you do not have permissions to add labels yourself.
Hello @yanamg7, from your logs:
2023-11-06T11:09:53.092Z error exporterhelper/queue_sender.go:184 Dropping data because sending_queue is full. Try increasing queue_size. {"kind": "exporter", "data_type": "logs", "name": "otlp/2c27f38b-3399-4531-b66c-f557f646ebb0", "dropped_items": 100}
It looks like the default queue size is 1000 for the otlp
exporter, but you've set it to 5
in your configuration. Have you tested using this default?
Also, another error in your logs:
2023-11-06T11:09:53.719Z error consumerretry/logs.go:39 ConsumeLogs() failed. Enable retry_on_failure to slow down reading logs and avoid dropping. {"kind": "receiver", "name": "filelog", "data_type": "logs", "error": "sending_queue is full"}
Have you tried this with filelog receiver's retry_on_failure
option enabled?
Hi @crobert-1 I'm using exactly the same configuration as I used for otel configuration without connector. Here is it
extensions:
zpages: null
receivers:
filelog:
exclude: []
include:
- /TODA/log/*.log
include_file_name: true
include_file_path: true
operators:
- field: attributes.log
type: add
value: unspecified
- field: attributes.ct
type: add
value: unspecified
- field: attributes.scid
type: add
value: unspecified
- field: attributes.sysid
type: add
value: unspecified
- field: attributes.sev
type: add
value: unspecified
- field: attributes.ts
type: add
value: 1970-01-01 00:00:00.000000 UTC
- field: attributes.mid
type: add
value: unspecified
- type: json_parser
- mapping:
fatal: emergency
info4: notice
parse_from: attributes.sev
type: severity_parser
- from: attributes.log
id: move1
to: resource["Log Message"]
type: move
- from: attributes.ct
id: move2
to: resource["Container Name"]
type: move
- from: attributes.scid
id: move3
to: resource["Unique Container ID"]
type: move
- from: attributes.sysid
id: move4
to: resource["System ID"]
type: move
- from: attributes.sev
id: move5
to: resource["Severity"]
type: move
- from: attributes.ts
id: move6
to: resource["Timestamp"]
type: move
- from: attributes.mid
id: move7
to: resource["Message ID"]
type: move
resource:
- dataType: bigipn-pods-log
start_at: beginning
otlp/mtls:
protocols:
grpc:
endpoint: 0.0.0.0:4517
tls:
cert_file: ${OTEL_TLS_CERT_FILE}
client_ca_file: ${OTEL_TLS_CA_FILE}
key_file: ${OTEL_TLS_KEY_FILE}
min_version: "1.2"
reload_interval: 100000
processors:
batch: null
exporters:
file:
path: /var/log/otlp.json
logging:
sampling_initial: 0
sampling_thereafter: 0
verbosity: basic
otlp/2c27f38b-3399-4531-b66c-f557f646ebb0:
endpoint: 10.144.74.39:4417
sending_queue:
enabled: true
queue_size: 5
retry_on_failure:
enabled: true
initial_interval: 2s
max_elapsed_time: 6s
tls:
insecure: true
otlp/ccb8b939-5dab-465f-bcf4-921385d8001d:
endpoint: 10.144.74.39:4517
sending_queue:
enabled: true
queue_size: 5
retry_on_failure:
enabled: true
initial_interval: 2s
max_elapsed_time: 6s
tls:
insecure: true
otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea:
endpoint: 10.144.74.39:4317
sending_queue:
enabled: true
queue_size: 5
retry_on_failure:
enabled: true
initial_interval: 2s
max_elapsed_time: 6s
tls:
insecure: true
service:
pipelines:
logs:
exporters:
- logging
- otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
- otlp/2c27f38b-3399-4531-b66c-f557f646ebb0
- otlp/ccb8b939-5dab-465f-bcf4-921385d8001d
processors:
- batch
receivers:
- otlp/mtls
- filelog
metrics:
exporters:
- logging
- otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
- otlp/2c27f38b-3399-4531-b66c-f557f646ebb0
- otlp/ccb8b939-5dab-465f-bcf4-921385d8001d
processors:
- batch
receivers:
- otlp/mtls
extensions:
- zpages
It works OK, with no errors, and no dropped resources. We need connector routing for our logic, I'm trying to add it and have these problems. I'll try the recommended default queue size and retry on failure for receivers. But the question is why this configuration works without connector routing and has so many dropped resources with connector routing.
Hi @crobert-1. I can't use the default queue size, the pod is limited, in this case, I prefer to lose some resources instead of crashing with a memory issue. Still have the same question: in multiple receivers, it works OK, with connector routing it works slowly
Hello @yanamg7, I'm going to defer to the code owners for an answer here, as they are better suited to answer this.
I expect that this is simply a result of the queue size being so small. The routing connector may introduce a "small" delay in the data pipeline (checking data values and figuring out where to route them), and this "small" delay may be enough to overload your receiver's queue. This is just a hunch though, it may be something more, that's why I prefer to let others who are more informed to take a look.
Hi @yanamg7. I suspect that @crobert-1's hunch is probably correct. If you wanted to compare the performance with the routing processor you could. If you try the processor, you'll have to update your routing rules to route to exporters (instead of piplines) and you'll have to prefix your ottl statements with resource.
. See the config here for an example. If you do this and you find a notable difference, please let us know.
I'm not sure if you have experimented with the memory limiter processor, but it might be another way to help you limit resources while using a larger queue size.
I'll try to run a similar setup locally, but without access to the metrics, it's hard to tell which component is the bottleneck here.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
Component(s)
No response
What happened?
Description
We conducted POC for the routing methods. When we used routing connectors we saw the resources drop and increase the memory usage. We tried sending 15K resources per second: logs and metrics, otlp, and filereceiver. We tried to run the same traffic using multiple receivers (otlp receiver using several ports, filereceiver using different file names matching) and traffic was processed without errors. I tried it without profiling but after seeing errors, I added profiling
Steps to Reproduce
Run traffic 15K resources per sec
Expected Result
No dropped resources
Actual Result
Dropped resources and error, increased memory and CPU
Collector version
otelcol-contrib version 0.88.0
Environment information
Environment
Ubuntu 22.04.2 LTS pod resources: limits: memory: 3Gi requests: memory: 500Mi
OpenTelemetry Collector configuration
Log output
Additional context
No response