High CPU usage on `prometheus_remote_write` sink

Dreamacro commented 1 year ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

CPU usage slowly climbs to 100% when using prometheus_remote_write sink. When I try the console sink the problem goes away.

type counter and prometheus_remote_write is fine
type histogram and prometheus_remote_write slowly climbs to cpu 100%
type histogram and console is fine

I'm not using the original Prometheus, uses an API-compatible VictoriaMetrics

I noticed that the recent version rewrote prometheus_remote_write, I used the v0.33.1 and nightly versions and the result was the same

Configuration

data_dir = "/data/vector-data"

[sources.filelog]
type = "file"
read_from = "end"
include = [ "/var/log/app/**/*.log" ]

[transforms.app]
type = "remap"
inputs = ["filelog"]
source = """
. = parse_json!(.message)
.hostname = get_hostname!()
"""

[transforms.split_app]
type = "route"
inputs = [ "app" ]
  [transforms.split_app.route]
  openapi = 'exists(.latency) && exists(.method) && !exists(.websocket) && starts_with!(.path, "/openapi/")'
  api = 'exists(.latency) && exists(.method) && !exists(.websocket) && starts_with!(.path, "/api/")'

[transforms.prometheus_http_request_duration_openapi]
type = "log_to_metric"
inputs = [ "split_app.openapi" ]
  [[transforms.prometheus_http_request_duration_openapi.metrics]]
  type = "histogram"
  field = "latency"
  name = "openapi_http_request_duration"
  namespace = "app"
  tags.hostname = "{{ hostname }}"
  tags.status = "{{ status }}"

[transforms.prometheus_http_request_duration_api]
type = "log_to_metric"
inputs = [ "split_app.api" ]
  [[transforms.prometheus_http_request_duration_api.metrics]]
  type = "histogram"
  field = "latency"
  name = "api_http_request_duration"
  namespace = "app"
  tags.hostname = "{{ hostname }}"
  tags.status = "{{ status }}"

[sinks.prometheus_remote_write]
type = "prometheus_remote_write"
inputs = [ "prometheus_http_request_duration_openapi", "prometheus_http_request_duration_api" ]
endpoint = "https://vmendpoint/insert/0/prometheus"
request.concurrency = 'adaptive'
healthcheck.enabled = false

# [sinks.console]
# type = "console"
# inputs = [ "prometheus_http_request_duration_openapi", "prometheus_http_request_duration_api" ]
# encoding.codec = "json"

Version

v0.33.1 and nightly

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

zamazan4ik commented 1 year ago

Hi! Thanks a lot for the report!

Could you please create the performance profile under the 100% load in the case with histogram and prometheus_remote_write? It could be done with perf as described here: https://github.com/vectordotdev/vector/issues/13919#issuecomment-1210941371 . With such information it would be much easier for the Vector dev team to analyze the root cause of the problem.

Thank you so much for your attention and participation.

P.S. @jszwedko I guess we need to add some guide to the Vector documentation about Vector performance troubleshooting. And describe there some possible steps like checking backpressure conditions via vector top, perf'ing Vector with viewing flamegraphs, maybe something else.

StephenWakely commented 1 year ago

I believe I can reproduce with this config on master:

sources:
  in:
    type: demo_logs
    format: apache_common
    interval: 0

transforms:
  workit:
    type: remap
    inputs:
      - in
    source: |
      . = parse_apache_log!(.message, "common")

  zork:
    type: log_to_metric
    inputs: 
      - workit
    metrics:
      - type: histogram
        field: size
        name: request_size
        namespace: app
        tags:
          hostname: "{{ host }}"
          status: "{{ status }}"

sinks:
  # out:
  #   type: console
  #   inputs:
  #     - zork
  #   target: stdout
  #   encoding: 
  #     codec: json

  prom:
    type: prometheus_remote_write
    inputs:
    - zork
    compression: snappy
    endpoint: http://localhost:8480/insert/0/prometheus

Running Victoria Metrics as per the wiki.

Running vector with a single thread hits the cpu 100%. 8 threads hits 70-90%. Console uses way less. I don't see any difference between histogram and counter though.

Not seeing the CPU creep up though. What sort of timescale are we talking here?

Dreamacro commented 1 year ago

@zamazan4ik Thanks, I will do perf next week.

@StephenWakely on 40~60 QPS for two days, Would this be related to Victoria Metrics instead of the original Prometheus?

StephenWakely commented 1 year ago

Ah ok, I was just running for an hour or so.

Would this be related to Victoria Metrics instead of the original Prometheus?

No, as far as Vector is concerned there is no difference between Prometheus and Victoria.

Dreamacro commented 1 year ago

Another phenomenon is that he will not write anything at all to the VM at 100% CPU

Dreamacro commented 1 year ago

@zamazan4ik Here's the perf when the CPU climbs to

40%+ perf-40%.data.zip 100% perf.data.zip

~~I also found an interesting phenomenon: when the CPU climbs to 100% if you change the hostname, the CPU drops completely and starts climbing again.~~

Dreamacro commented 11 months ago

I think I misused prometheus_remote_write, I used vector to filter from the raw logs and count p90, p99 and report this to Victoria Metrics.

prometheus_remote_write doesn't aggregate data, it can only buffer metrics, so when the QPS is high enough, it can cause backpressure and case 100% CPU.

I'm now using a workaround that allows me to "aggregate" the metrics data and continue to report it using prometheus_remote_write. I don't feel like it was misused, but I'm not sure what I should do in vector to achieve my goal.

# little hack to make metrics aggregation
[sinks.prometheus_exporter]
type = "prometheus_exporter"
inputs = [
  "prometheus_http_request_duration_openapi",
  "prometheus_http_request_duration_api"
]
address = "127.0.0.1:9598"

[sources.prometheus_scrape]
type = "prometheus_scrape"
endpoints = [ "http://127.0.0.1:9598/metrics" ]

[sinks.prometheus_remote_write]
type = "prometheus_remote_write"
inputs = [ "prometheus_scrape" ]
endpoint = "https://vm:8480/insert/0/prometheus"
healthcheck.enabled = false

StephenWakely commented 11 months ago

Would the aggregate transform do what you need?

Dreamacro commented 11 months ago

@StephenWakely I did try it, but he doesn't reduce the number of metrics.

vectordotdev / vector