vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.26k stars 1.61k forks source link

Increasing scrape durations for `prometheus_exporter` sink #7191

Closed jszwedko closed 2 years ago

jszwedko commented 3 years ago

Reported by user in discord: https://discord.com/channels/742820443487993987/746070591097798688/834105139220185168

They are observing increasing scrape times relative to the length of time Vector has been running:

grafana

The number of time series being exported seems to vary between 13k and 19k. I thought maybe the scrape time was increasing relative to the number of time series, but they reported that that wasn't the case.

Vector Version

vector 0.13.0 (g024fbf7 x86_64-unknown-linux-musl 2021-04-15)

Vector Configuration File

---
sources:
  nginx_input_vector:
    type: file
    data_dir: /var/lib/vector
    include:
      - /app/nginx/logs/json-access-sp.log
    read_from: beginning
    fingerprinting:
      strategy: device_and_inode

transforms:
  nginx_parse_json:
    inputs:
      - nginx_input_vector
    type: remap
    source: |
      . = parse_json!(.message)

  nginx_parse_remap:
    inputs:
      - nginx_parse_json
    type: remap
    source: |
      if !match(.remote_user, r'^(ATG|B2C|CRM|FOBO|TS|BTX|RTD|Magnolia)$') {
        .remote_user = "other"
      }
      del(.file)
      del(.host)
      del(.source_type)
      .request_uri = replace(string!(.request_uri), r'\d{16}', "xxx")
      .request_time = to_float!(.request_time)
      .status = to_int!(.status)

  nginx_http_metrics:
    type: log_to_metric
    inputs:
      - nginx_parse_remap
    metrics:
      - type: counter
        field: status
        name: http_response_count_total
        namespace: "${HTTP_METRICS_NAMESPACE}"
        tags:
          host: "${HOSTNAME}"
          remote_user: '{{ remote_user }}'
          request_uri: '{{ request_uri }}'
          status: '{{ status }}'
      - type: histogram
        field: request_time
        name: http_response_duration_seconds
        namespace: "${HTTP_METRICS_NAMESPACE}"
        tags:
          host: "${HOSTNAME}"
          remote_user: '{{ remote_user }}'
          request_uri: '{{ request_uri }}'
          status: '{{ status }}'
      - type: gauge
        field: request_time
        name: http_response_duration_seconds
        namespace: "${HTTP_METRICS_NAMESPACE}"
        tags:
          host: "${HOSTNAME}"
          remote_user: '{{ remote_user }}'
          request_uri: '{{ request_uri }}'
          status: '{{ status }}'

sinks:
  nginx_output_prometheus:
    address: '0.0.0.0:9598'
    inputs:
      - nginx_http_metrics
    type: prometheus_exporter
    default_namespace: vector
    quantiles:
      - 0.5
      - 0.75
      - 0.9
      - 0.95
      - 0.99

Expected Behavior

Actual Behavior

Example Data

https://pastebin.com/xhzNcW0w

Additional Context

jszwedko commented 3 years ago

I'll triage this at least.

banschikovde commented 3 years ago

Hello! I have been watching the vector for the last 24 hours. You were right from the beginning that, the scrape time was increasing relative to the number of time series. This can be seen in the screenshot from Grafana:

vector_debug_screen

jszwedko commented 3 years ago

I took a look at this today and was able to reproduce increasing scrape times, but not nearly at the magnitude shown here. After processing around 2 GB of data with ~14k timeseries, I was still seeing scrape times of around ~.2s. They were very gradually increasing though, as more data was processed, so there might be something there to investigate.

@Denissa89 is it possible to share a sample of the data going into Vector as well as one of the scrapes output? I'm wondering if there is something I'm missing about the its profile. Also, I noticed you are using an old nightly version of Vector. Could you try with the recently released 0.13.0?

How I reproduced:

Vector config:

sources:
  nginx_input_vector:
    type: socket
    mode: tcp
    address: "0.0.0.0:8080"
  internal_metrics:
    type: internal_metrics

transforms:
  nginx_parse_json:
    inputs:
      - nginx_input_vector
    type: remap
    source: |
      . = parse_json!(.message)

  nginx_parse_remap:
    inputs:
      - nginx_parse_json
    type: remap
    source: |
      if !match(.remote_user, r'^(ATG|B2C|CRM|FOBO|TS|BTX|RTD|Magnolia)$') {
        .remote_user = "other"
      }
      del(.file)
      del(.host)
      del(.source_type)
      .request_uri = replace(string!(.request_uri), r'\d{16}', "xxx")
      .request_time = to_float!(.request_time)
      .status = to_int!(.status)

  nginx_http_metrics:
    type: log_to_metric
    inputs:
      - nginx_parse_remap
    metrics:
      - type: counter
        field: status
        name: http_response_count_total
        namespace: "${HTTP_METRICS_NAMESPACE}"
        tags:
          host: "${HOSTNAME}"
          remote_user: '{{ remote_user }}'
          request_uri: '{{ request_uri }}'
          status: '{{ status }}'
      - type: histogram
        field: request_time
        name: http_response_duration_seconds
        namespace: "${HTTP_METRICS_NAMESPACE}"
        tags:
          host: "${HOSTNAME}"
          remote_user: '{{ remote_user }}'
          request_uri: '{{ request_uri }}'
          status: '{{ status }}'
      - type: gauge
        field: request_time
        name: http_response_duration_seconds
        namespace: "${HTTP_METRICS_NAMESPACE}"
        tags:
          host: "${HOSTNAME}"
          remote_user: '{{ remote_user }}'
          request_uri: '{{ request_uri }}'
          status: '{{ status }}'

sinks:
  nginx_output_prometheus:
    address: '0.0.0.0:9598'
    inputs:
      - internal_metrics
      - nginx_http_metrics
    type: prometheus_exporter
    default_namespace: vector
    quantiles:
      - 0.5
      - 0.75
      - 0.9
      - 0.95
      - 0.99

I generated fake lines using this script:

#!/bin/bash

remote_users=("ATG" "B2C" "CRM" "FOBO" "TS" "BTX" "RTD" "Magnolia" "other")
statuses=(200 400 500 404 503)

while : ; do
  remote_user=${remote_users[$RANDOM % ${#remote_users[@]} ]}
  status=${statuses[$RANDOM % ${#statuses[@]} ]}
  request_time=$RANDOM
  request_uri="somepath/$(($RANDOM % 20))"

  echo "{ \"remote_user\": \"$remote_user\", \"status\": \"$status\", \"request_time\": \"$request_time\", \"request_uri\": \"$request_uri\"}"
done

And sent it through netcat to vector:

/tmp/generate.sh |  netcat localhost 8080

Using version:

vector 0.13.0 (v0.13.0 x86_64-apple-darwin 2021-04-21)
banschikovde commented 3 years ago

I will update Vector to the new version soon.

I attach the scrapes output below vector_scrape.txt

jszwedko commented 3 years ago

Hi @Denissa89 . I was just wondering if you had a chance to try out the newer vector version and if you noticed any difference in behavior.

zamazan4ik commented 2 years ago

Is this issue actual to the newest Vector version?

jszwedko commented 2 years ago

Yeah, good question. I'll close this as stale, but if anyone is still observing this feel free to comment or open a new issue.