open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

Performance and dropped resouces #28900

Closed yanamg7 closed 6 months ago

yanamg7 commented 11 months ago

Component(s)

No response

What happened?

Description

We conducted POC for the routing methods. When we used routing connectors we saw the resources drop and increase the memory usage. We tried sending 15K resources per second: logs and metrics, otlp, and filereceiver. We tried to run the same traffic using multiple receivers (otlp receiver using several ports, filereceiver using different file names matching) and traffic was processed without errors. I tried it without profiling but after seeing errors, I added profiling

Steps to Reproduce

Run traffic 15K resources per sec

Expected Result

No dropped resources

Actual Result

Dropped resources and error, increased memory and CPU

Collector version

otelcol-contrib version 0.88.0

Environment information

Environment

Ubuntu 22.04.2 LTS pod resources: limits: memory: 3Gi requests: memory: 500Mi

OpenTelemetry Collector configuration

---
extensions:
  zpages: null
  pprof/1:
        endpoint: ":1777"
        block_profile_fraction: 3
        mutex_profile_fraction: 5
        save_to_file: "/prof_file.ppf"
receivers:
  filelog:
    exclude: []
    include:
      - /TODA/bench_log/*.log
    include_file_name: true
    include_file_path: true
    operators:
      - field: attributes.log
        type: add
        value: unspecified
      - field: attributes.ct
        type: add
        value: unspecified
      - field: attributes.scid
        type: add
        value: unspecified
      - field: attributes.sysid
        type: add
        value: unspecified
      - field: attributes.sev
        type: add
        value: unspecified
      - field: attributes.ts
        type: add
        value: 1970-01-01 00:00:00.000000 UTC
      - field: attributes.mid
        type: add
        value: unspecified
      - type: json_parser
      - mapping:
          fatal: emergency
          info4: notice
        parse_from: attributes.sev
        type: severity_parser
      - from: attributes.log
        id: move1
        to: resource["Log Message"]
        type: move
      - from: attributes.ct
        id: move2
        to: resource["Container Name"]
        type: move
      - from: attributes.scid
        id: move3
        to: resource["Unique Container ID"]
        type: move
      - from: attributes.sysid
        id: move4
        to: resource["System ID"]
        type: move
      - from: attributes.sev
        id: move5
        to: resource["Severity"]
        type: move
      - from: attributes.ts
        id: move6
        to: resource["Timestamp"]
        type: move
      - from: attributes.mid
        id: move7
        to: resource["Message ID"]
        type: move  

    resource:
      - dataType: bigipn-pods-log
    start_at: beginning

  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4617
        tls:
          cert_file: ${OTEL_TLS_CERT_FILE}
          client_ca_file: ${OTEL_TLS_CA_FILE}
          key_file: ${OTEL_TLS_KEY_FILE}
          min_version: "1.2"
          reload_interval: 100000

processors:
  batch: null

exporters:
    file:
      path: /var/log/otlp.json
    logging:
      sampling_initial: 0
      sampling_thereafter: 0
      verbosity: basic
    otlp/2c27f38b-3399-4531-b66c-f557f646ebb0:
        endpoint: 10.144.74.39:4417
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
    otlp/ccb8b939-5dab-465f-bcf4-921385d8001d:
        endpoint: 10.144.74.39:4517
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
    otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea:
        endpoint: 10.144.74.39:4317
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
connectors:
  routing/logs:
      default_pipelines: [logs/otlp-all]
      error_mode: propagate
      table:
        - statement: route() where IsMatch(attributes["FileName"], ".*access.*log")
          pipelines: [logs/access]
        - statement: route() where IsMatch(attributes["FileName"], ".*sec.*log")
          pipelines: [logs/sec]
        - statement: route() where not IsMatch(attributes["FileName"], ".*sec.*log") and not IsMatch(attributes["FileName"], ".*access.*log") 
          pipelines: [logs/container]
  routing/metrics:
      default_pipelines: [metrics/1]
      error_mode: propagate
      table:
        - statement: route() where attributes["dataType"] == "type1"
          pipelines: [metrics/1]
        - statement: route() where attributes["dataType"] == "type2"
          pipelines: [metrics/2]
        - statement: route() where attributes["dataType"] == "type3"
          pipelines: [metrics/3]
service:
  pipelines:
    logs/in:
      receivers: [filelog]
      exporters: [routing/logs]
    metrics/in:
      receivers: [otlp]
      exporters: [routing/metrics] 
    logs/otlp-all:
      exporters:
        - logging
        - otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
      receivers:
        - routing/logs
    logs/access:
      exporters:
        - logging
        - otlp/ccb8b939-5dab-465f-bcf4-921385d8001d
      receivers:
        - routing/logs
    logs/container:
      exporters:
        - logging
        - otlp/2c27f38b-3399-4531-b66c-f557f646ebb0
      receivers:
        - routing/logs
    logs/sec:
      exporters:
        - logging
        - otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
      receivers:
        - routing/logs
    metrics/1:
      exporters:
        - logging
        - otlp/2c27f38b-3399-4531-b66c-f557f646ebb0
      receivers:
        - routing/metrics
    metrics/2:
      exporters:
        - logging
        - otlp/ccb8b939-5dab-465f-bcf4-921385d8001d
      receivers:
        - routing/metrics
    metrics/3:
      exporters:
        - logging
        - otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
      receivers:
        - routing/metrics 
  extensions:
    - zpages
    - pprof/1

Log output

2023-11-06T11:09:53.092Z    error   exporterhelper/queue_sender.go:184  Dropping data because sending_queue is full. Try increasing queue_size. {"kind": "exporter", "data_type": "logs", "name": "otlp/2c27f38b-3399-4531-b66c-f557f646ebb0", "dropped_items": 100}
go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).send
    go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/queue_sender.go:184
go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send
    go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/common.go:196
go.opentelemetry.io/collector/exporter/exporterhelper.NewLogsExporter.func1
    go.opentelemetry.io/collector/exporter@v0.88.0/exporterhelper/logs.go:100
go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs
    go.opentelemetry.io/collector/consumer@v0.88.0/logs.go:25
go.opentelemetry.io/collector/internal/fanoutconsumer.(*logsConsumer).ConsumeLogs
    go.opentelemetry.io/collector@v0.88.0/internal/fanoutconsumer/logs.go:75
go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs
    go.opentelemetry.io/collector/consumer@v0.88.0/logs.go:25
github.com/open-telemetry/opentelemetry-collector-contrib/connector/routingconnector.(*logsConnector).ConsumeLogs
    github.com/open-telemetry/opentelemetry-collector-contrib/connector/routingconnector@v0.88.0/logs.go:97
go.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs
    go.opentelemetry.io/collector/consumer@v0.88.0/logs.go:25
github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/consumerretry.(*logsConsumer).ConsumeLogs
    github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal@v0.88.0/consumerretry/logs.go:37
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/adapter.(*receiver).consumerLoop
    github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.88.0/adapter/receiver.go:124
2023-11-06T11:09:53.719Z    error   consumerretry/logs.go:39    ConsumeLogs() failed. Enable retry_on_failure to slow down reading logs and avoid dropping. {"kind": "receiver", "name": "filelog", "data_type": "logs", "error": "sending_queue is full"}
github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/consumerretry.(*logsConsumer).ConsumeLogs
    github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal@v0.88.0/consumerretry/logs.go:39
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/adapter.(*receiver).consumerLoop
    github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.88.0/adapter/receiver.go:124
2023-11-06T11:09:53.720Z    error   adapter/receiver.go:126 ConsumeLogs() failed    {"kind": "receiver", "name": "filelog", "data_type": "logs", "error": "sending_queue is full"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/adapter.(*receiver).consumerLoop
    github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.88.0/adapter/receiver.go:126

Additional context

No response

yanamg7 commented 11 months ago

This is the full output log otel_log.txt

github-actions[bot] commented 11 months ago

Pinging code owners for connector/routing: @jpkrohling @mwear. See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 commented 11 months ago

Hello @yanamg7, from your logs:

2023-11-06T11:09:53.092Z    error   exporterhelper/queue_sender.go:184  Dropping data because sending_queue is full. Try increasing queue_size. {"kind": "exporter", "data_type": "logs", "name": "otlp/2c27f38b-3399-4531-b66c-f557f646ebb0", "dropped_items": 100}

It looks like the default queue size is 1000 for the otlp exporter, but you've set it to 5 in your configuration. Have you tested using this default?

Also, another error in your logs:

2023-11-06T11:09:53.719Z    error   consumerretry/logs.go:39    ConsumeLogs() failed. Enable retry_on_failure to slow down reading logs and avoid dropping. {"kind": "receiver", "name": "filelog", "data_type": "logs", "error": "sending_queue is full"}

Have you tried this with filelog receiver's retry_on_failure option enabled?

yanamg7 commented 11 months ago

Hi @crobert-1 I'm using exactly the same configuration as I used for otel configuration without connector. Here is it

extensions:
    zpages: null
receivers:
    filelog:
        exclude: []
        include:
            - /TODA/log/*.log
        include_file_name: true
        include_file_path: true
        operators:
            - field: attributes.log
              type: add
              value: unspecified
            - field: attributes.ct
              type: add
              value: unspecified
            - field: attributes.scid
              type: add
              value: unspecified
            - field: attributes.sysid
              type: add
              value: unspecified
            - field: attributes.sev
              type: add
              value: unspecified
            - field: attributes.ts
              type: add
              value: 1970-01-01 00:00:00.000000 UTC
            - field: attributes.mid
              type: add
              value: unspecified
            - type: json_parser
            - mapping:
                fatal: emergency
                info4: notice
              parse_from: attributes.sev
              type: severity_parser
            - from: attributes.log
              id: move1
              to: resource["Log Message"]
              type: move
            - from: attributes.ct
              id: move2
              to: resource["Container Name"]
              type: move
            - from: attributes.scid
              id: move3
              to: resource["Unique Container ID"]
              type: move
            - from: attributes.sysid
              id: move4
              to: resource["System ID"]
              type: move
            - from: attributes.sev
              id: move5
              to: resource["Severity"]
              type: move
            - from: attributes.ts
              id: move6
              to: resource["Timestamp"]
              type: move
            - from: attributes.mid
              id: move7
              to: resource["Message ID"]
              type: move
        resource:
            - dataType: bigipn-pods-log
        start_at: beginning
    otlp/mtls:
        protocols:
            grpc:
                endpoint: 0.0.0.0:4517
                tls:
                    cert_file: ${OTEL_TLS_CERT_FILE}
                    client_ca_file: ${OTEL_TLS_CA_FILE}
                    key_file: ${OTEL_TLS_KEY_FILE}
                    min_version: "1.2"
                    reload_interval: 100000
processors:
    batch: null
exporters:
    file:
        path: /var/log/otlp.json
    logging:
        sampling_initial: 0
        sampling_thereafter: 0
        verbosity: basic
    otlp/2c27f38b-3399-4531-b66c-f557f646ebb0:
        endpoint: 10.144.74.39:4417
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
    otlp/ccb8b939-5dab-465f-bcf4-921385d8001d:
        endpoint: 10.144.74.39:4517
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
    otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea:
        endpoint: 10.144.74.39:4317
        sending_queue:
            enabled: true
            queue_size: 5
        retry_on_failure:
            enabled: true
            initial_interval: 2s
            max_elapsed_time: 6s
        tls:
            insecure: true
service:
    pipelines:
        logs:
            exporters:
                - logging
                - otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
                - otlp/2c27f38b-3399-4531-b66c-f557f646ebb0
                - otlp/ccb8b939-5dab-465f-bcf4-921385d8001d
            processors:
                - batch
            receivers:
                - otlp/mtls
                - filelog
        metrics:
            exporters:
                - logging
                - otlp/fd71bd0c-de38-4512-8a78-f07918f9a0ea
                - otlp/2c27f38b-3399-4531-b66c-f557f646ebb0
                - otlp/ccb8b939-5dab-465f-bcf4-921385d8001d
            processors:
                - batch
            receivers:
                - otlp/mtls
    extensions:
        - zpages

It works OK, with no errors, and no dropped resources. We need connector routing for our logic, I'm trying to add it and have these problems. I'll try the recommended default queue size and retry on failure for receivers. But the question is why this configuration works without connector routing and has so many dropped resources with connector routing.

yanamg7 commented 11 months ago

Hi @crobert-1. I can't use the default queue size, the pod is limited, in this case, I prefer to lose some resources instead of crashing with a memory issue. Still have the same question: in multiple receivers, it works OK, with connector routing it works slowly

crobert-1 commented 11 months ago

Hello @yanamg7, I'm going to defer to the code owners for an answer here, as they are better suited to answer this.

I expect that this is simply a result of the queue size being so small. The routing connector may introduce a "small" delay in the data pipeline (checking data values and figuring out where to route them), and this "small" delay may be enough to overload your receiver's queue. This is just a hunch though, it may be something more, that's why I prefer to let others who are more informed to take a look.

mwear commented 11 months ago

Hi @yanamg7. I suspect that @crobert-1's hunch is probably correct. If you wanted to compare the performance with the routing processor you could. If you try the processor, you'll have to update your routing rules to route to exporters (instead of piplines) and you'll have to prefix your ottl statements with resource.. See the config here for an example. If you do this and you find a notable difference, please let us know.

I'm not sure if you have experimented with the memory limiter processor, but it might be another way to help you limit resources while using a larger queue size.

jpkrohling commented 10 months ago

I'll try to run a similar setup locally, but without access to the metrics, it's hard to tell which component is the bottleneck here.

github-actions[bot] commented 8 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 6 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.