vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.83k stars 1.58k forks source link

component_received_events_total value is too large for vector_metrics #17952

Closed e-vasilyev closed 1 year ago

e-vasilyev commented 1 year ago

A note for the community

Problem

The value of the component_received_events_total metric for the vector_metrics component is constantly growing.

I am using query: irate (vector_component_received_events_total{component_kind="source", component_name="vector_metrics", namespace="dtm-dev"}[15s])

image

Other components do not have this problem.

Configuration

sources:
  vector_metrics:
    type: internal_metrics
    scrape_interval_secs: 5

Version

vector 0.31.0 (x86_64-unknown-linux-musl 0f13b22 2023-07-06 13:52:34.591204470)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 1 year ago

Hi @e-vasilyev !

Could you provide a complete config that demonstrates this issue? Could it be the case that there are actually an increasing number of internal metrics causing that source to receive an ever increasing number of events? You could look at https://vector.dev/docs/reference/configuration/sources/internal_metrics/#internal_metrics_cardinality_total to determine that.

e-vasilyev commented 1 year ago

Hi @jszwedko!

internal_metrics_cardinality has small values: image

Full config:

acknowledgements:
  enabled: false
data_dir: /var/lib/vector
api:
  enabled: true
  address: 0.0.0.0:8686
sources:
  fluent:
    type: fluent
    address: 0.0.0.0:24224
  vector_metrics:
    type: internal_metrics
    scrape_interval_secs: 5
  vector_logs:
    type: internal_logs
transforms:
  logs_json_route:
    type: route
    inputs:
      - fluent
    route:
      json: .tag == "json"
  logs_json_prepare:
    type: remap
    inputs:
      - logs_json_route.json
    source: |-
      del(.@version)
      del(.level_value)
      del(.tag)
      if exists(.@timestamp) {
        .timestamp = to_timestamp!(del(.@timestamp), unit: "milliseconds")
      }
  message_type_route:
    type: route
    inputs:
      - logs_json_prepare
    route:
      rq_rs: exists(.messageType) && includes(["request", "response"], .messageType)
      scl: exists(.messageType) && includes(["scl"], .messageType)
  audit_prepare:
    type: remap
    inputs:
      - message_type_route.rq_rs
      - message_type_route.scl
    source: |-
      allowFields = {
        "logger": .logger,
        "timestamp": .timestamp,
        "level": .level,
        "message": .message,
        "requestId": .requestId,
        "subRequestId": .subRequestId,
        "messageType": .messageType,
        "customerId": .customerId,
        "customerOgrn": .customerOgrn,
        "queryMnemonic": .queryMnemonic,
        "subRequestId": .subRequestId
      }
      . = compact(allowFields, string: false)
  audit_kafka_filter:
    type: filter
    inputs:
      - audit_prepare
    condition:
      type: "vrl"
      source: |-
        exists(.messageType) && includes(["request", "response"], .messageType)
  audit_kafka:
    type: remap
    inputs:
      - audit_kafka_filter
    source: |-
      .message = parse_json(.message) ?? .message
  audit_clickhouse:
    type: remap
    inputs:
      - audit_prepare
    source: |-
      .timestamp = to_unix_timestamp!(.timestamp, unit: "seconds")
  scl_message:
    type: remap
    inputs:
      - message_type_route.scl
    source: |-
      . = parse_json!(.message)
  vector_logs_prepare:
    type: remap
    inputs:
      - vector_logs
    source: |-
      .level = del(.metadata.level)
      .serviceName = "vector"
      .podName, _ = get_env_var("POD_NAME")
      del(.pid)
sinks:
  loki_sink:
    type: loki
    inputs:
      - message_type_route._unmatched
      - vector_logs_prepare
    endpoint: http://loki.dtm-infra-dev:3100
    remove_label_fields: true
    labels:
      environment: "test"
      serviceName: "{{ serviceName }}"
      host: "{{ host }}"
      level: "{{ level }}"
      source_type: "{{ source_type }}"
    compression: gzip
    encoding:
      codec: json
    out_of_order_action: accept
    batch:
      max_events: 100
      timeout_secs: 3
    buffer:
      max_size: 536870912
      type: disk
      when_full: block
  elasticsearch_sink:
    type: elasticsearch
    inputs:
      - message_type_route._unmatched
    api_version: v7
    bulk:
      action: index
      index: "dtm-test-%Y.%m.%d"
    batch:
      max_events: 100
      timeout_secs: 3
    buffer:
      max_size: 536870912
      type: disk
      when_full: block
    compression: gzip
    endpoints:
      - http://elastic.podd-ts:9200
  audit_kafka_sink:
    type: kafka
    inputs:
      - audit_kafka
    bootstrap_servers: "kafka-0.kafka-headless:9092"
    librdkafka_options:
      message.max.bytes: "10000000"
    topic: "audit.logs"
    compression: "gzip"
    encoding:
      codec: json
    buffer:
      max_size: 536870912
      type: disk
      when_full: drop_newest
  clickhouse_sink:
    type: clickhouse
    inputs:
      - audit_clickhouse
    database: "test"
    endpoint: "http://clickhouse.dtm-infra-dev:8123"
    table: logs
    compression: gzip
    batch:
      max_events: 50
      timeout_secs: 2
    buffer:
      max_size: 536870912
      type: disk
      when_full: drop_newest
    skip_unknown_fields: true
  podd_agent_sink:
    type: kafka
    inputs:
      - scl_message
    bootstrap_servers: "kafka-0.kafka-headless:9092"
    topic: "demo_view.scl.signal"
    acknowledgements: true
    compression: gzip
    encoding:
      codec: json
    buffer:
      max_size: 536870912
      type: disk
      when_full: block
  prometheus_sink:
    type: prometheus_exporter
    address: "0.0.0.0:9598"
    inputs:
      - vector_metrics

All metrics for component_sent_events_total image

jszwedko commented 1 year ago

Hi @e-vasilyev !

Can you show the graph of the internal metric cardinality total rather than rate? Your graph of the rate actually makes it look like it might be growing, but graphing the total will show that better.

e-vasilyev commented 1 year ago

Hi @jszwedko vector_internal_metrics_cardinality_total: image

jszwedko commented 1 year ago

Thanks @e-vasilyev ! This issue looks like it is likely to be caused by the lack of #15426 . As a workaround you could try configuring metrics expiry: https://vector.dev/docs/reference/configuration/global-options/#expire_metrics_secs

e-vasilyev commented 1 year ago

Hi @jszwedko. Thank you! As a workaround, it works.

jszwedko commented 1 year ago

👍 I'll close this issue as a duplicate of https://github.com/vectordotdev/vector/issues/15426. You can follow along on that issue for any updates. Thanks for the discussion!