Open Dreamacro opened 1 year ago
Hi! Thanks a lot for the report!
Could you please create the performance profile under the 100% load in the case with histogram
and prometheus_remote_write
? It could be done with perf
as described here: https://github.com/vectordotdev/vector/issues/13919#issuecomment-1210941371 . With such information it would be much easier for the Vector dev team to analyze the root cause of the problem.
Thank you so much for your attention and participation.
P.S. @jszwedko I guess we need to add some guide to the Vector documentation about Vector performance troubleshooting. And describe there some possible steps like checking backpressure conditions via vector top
, perf
'ing Vector with viewing flamegraphs, maybe something else.
I believe I can reproduce with this config on master:
sources:
in:
type: demo_logs
format: apache_common
interval: 0
transforms:
workit:
type: remap
inputs:
- in
source: |
. = parse_apache_log!(.message, "common")
zork:
type: log_to_metric
inputs:
- workit
metrics:
- type: histogram
field: size
name: request_size
namespace: app
tags:
hostname: "{{ host }}"
status: "{{ status }}"
sinks:
# out:
# type: console
# inputs:
# - zork
# target: stdout
# encoding:
# codec: json
prom:
type: prometheus_remote_write
inputs:
- zork
compression: snappy
endpoint: http://localhost:8480/insert/0/prometheus
Running Victoria Metrics as per the wiki.
Running vector with a single thread hits the cpu 100%. 8 threads hits 70-90%. Console uses way less. I don't see any difference between histogram and counter though.
Not seeing the CPU creep up though. What sort of timescale are we talking here?
@zamazan4ik Thanks, I will do perf next week.
@StephenWakely on 40~60 QPS for two days, Would this be related to Victoria Metrics instead of the original Prometheus?
Ah ok, I was just running for an hour or so.
Would this be related to Victoria Metrics instead of the original Prometheus?
No, as far as Vector is concerned there is no difference between Prometheus and Victoria.
Another phenomenon is that he will not write anything at all to the VM at 100% CPU
@zamazan4ik Here's the perf when the CPU climbs to
40%+ perf-40%.data.zip 100% perf.data.zip
I also found an interesting phenomenon: when the CPU climbs to 100% if you change the hostname, the CPU drops completely and starts climbing again.
I think I misused prometheus_remote_write
, I used vector to filter from the raw logs and count p90, p99 and report this to Victoria Metrics.
prometheus_remote_write
doesn't aggregate data, it can only buffer metrics, so when the QPS is high enough, it can cause backpressure and case 100% CPU.
I'm now using a workaround that allows me to "aggregate" the metrics data and continue to report it using prometheus_remote_write
. I don't feel like it was misused, but I'm not sure what I should do in vector to achieve my goal.
# little hack to make metrics aggregation
[sinks.prometheus_exporter]
type = "prometheus_exporter"
inputs = [
"prometheus_http_request_duration_openapi",
"prometheus_http_request_duration_api"
]
address = "127.0.0.1:9598"
[sources.prometheus_scrape]
type = "prometheus_scrape"
endpoints = [ "http://127.0.0.1:9598/metrics" ]
[sinks.prometheus_remote_write]
type = "prometheus_remote_write"
inputs = [ "prometheus_scrape" ]
endpoint = "https://vm:8480/insert/0/prometheus"
healthcheck.enabled = false
Would the aggregate transform do what you need?
@StephenWakely I did try it, but he doesn't reduce the number of metrics.
A note for the community
Problem
CPU usage slowly climbs to 100% when using
prometheus_remote_write
sink. When I try theconsole
sink the problem goes away.counter
andprometheus_remote_write
is finehistogram
andprometheus_remote_write
slowly climbs to cpu 100%histogram
andconsole
is fineI'm not using the original Prometheus, uses an API-compatible VictoriaMetrics
I noticed that the recent version rewrote
prometheus_remote_write
, I used the v0.33.1 and nightly versions and the result was the sameConfiguration
Version
v0.33.1 and nightly
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response