Open KhafRuslan opened 2 months ago
The qryn process is single threaded so you either need to scale multiple writers/readers and distribute traffic to achieve your desired capacity or use the qryn otel-collector and write directly into ClickHouse at max speed. Remember most of the performance is on the clickhouse side.
Rather, the description of the panels was confusing. I use qryn otel-collector, it was on it that I encountered the problem. single receiver configuration :
receivers:
loki:
protocols:
grpc:
endpoint: 0.0.0.0:3200
http:
endpoint: 0.0.0.0:3100
processors:
batch/logs:
send_batch_size: 8600
timeout: 400ms
memory_limiter/logs:
limit_percentage: 100
check_interval: 2s
exporters:
qryn:
dsn: http://qryn-chp1...
logs:
format: raw
retry_on_failure:
enabled: true
initial_interval: 5s
max_elapsed_time: 300s
max_interval: 30s
sending_queue:
queue_size: 1200
timeout: 10s
service:
extensions: [pprof, zpages, health_check]
pipelines:
logs:
exporters: [qryn]
processors: [batch/logs]
receivers: [loki]
telemetry:
logs:
level: "debug"
metrics:
address: 0.0.0.0:8888
If you are using the otel-collector to ingest, then I would assume the bottleneck being either with the collector or clickhouse rather than qryn itself. Did you observe any resource bottlenecking while operating the setup?
I ran into the problem not in qryn. It's with qryn-otel-collector. Perhaps I misunderstood your comment. I'm not sure if it's a resource problem, because it works correctly when I bring up another receiver
I'm not sure if it's a resource problem, because it works correctly when I bring up another receiver
We definitely need to investigate this further to understand what the root cause is. Could you show the multi-receiver config too?
At a certain point, when we reached a heavy load we encountered the problem of low speed of sending logs via promtail The difference is the speed of reading promtail logs from a file with the same configuration. In the screenshot promtail sent all messages to loki
Configuring the client part of promtail:
The solution was simple, we raised the second loki log receiver. After that we can observe a decrease in the graph above. The result is the same The average resource utilization of an instance was no higher than 30 percent