Reduced performance of sending logs

metrico / otel-collector

OpenTelemetry Collector for qryn with preconfigured ingestors for Loki, Prometheus, Influx, OTLP and many more

https://qryn.dev

Apache License 2.0

29 stars 9 forks source link

Reduced performance of sending logs #92

Open KhafRuslan opened 2 months ago

KhafRuslan commented 2 months ago

At a certain point, when we reached a heavy load we encountered the problem of low speed of sending logs via promtail The difference is the speed of reading promtail logs from a file with the same configuration. In the screenshot promtail sent all messages to loki

Configuring the client part of promtail:

clients:
  - url: http://127.0.0.1:3111/loki/api/v1/push
    batchwait: 1s
    batchsize: 100
    backoff_config:
      min_period: 100ms
      max_period: 5s
    external_labels:
      job: ${HOSTNAME}

The solution was simple, we raised the second loki log receiver. After that we can observe a decrease in the graph above. The result is the same The average resource utilization of an instance was no higher than 30 percent

lmangani commented 2 months ago

The qryn process is single threaded so you either need to scale multiple writers/readers and distribute traffic to achieve your desired capacity or use the qryn otel-collector and write directly into ClickHouse at max speed. Remember most of the performance is on the clickhouse side.

KhafRuslan commented 2 months ago

Rather, the description of the panels was confusing. I use qryn otel-collector, it was on it that I encountered the problem. single receiver configuration :

receivers:
  loki:
    protocols:
       grpc:
        endpoint: 0.0.0.0:3200
       http:
        endpoint: 0.0.0.0:3100

processors:
  batch/logs:
    send_batch_size: 8600
    timeout: 400ms
  memory_limiter/logs:
    limit_percentage: 100
    check_interval: 2s

exporters:
  qryn:
    dsn: http://qryn-chp1...
    logs:
      format: raw
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 300s
      max_interval: 30s
    sending_queue:
      queue_size: 1200
    timeout: 10s
service:
    extensions: [pprof, zpages, health_check]
    pipelines:
       logs:
         exporters: [qryn]
         processors: [batch/logs]
         receivers: [loki]
    telemetry:
      logs:
        level: "debug"
      metrics:
        address: 0.0.0.0:8888

lmangani commented 2 months ago

If you are using the otel-collector to ingest, then I would assume the bottleneck being either with the collector or clickhouse rather than qryn itself. Did you observe any resource bottlenecking while operating the setup?

KhafRuslan commented 2 months ago

I ran into the problem not in qryn. It's with qryn-otel-collector. Perhaps I misunderstood your comment. I'm not sure if it's a resource problem, because it works correctly when I bring up another receiver

lmangani commented 1 month ago

I'm not sure if it's a resource problem, because it works correctly when I bring up another receiver

We definitely need to investigate this further to understand what the root cause is. Could you show the multi-receiver config too?