vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.56k stars 1.54k forks source link

Random Vector OOMKills #21131

Open ShahroZafar opened 1 month ago

ShahroZafar commented 1 month ago

A note for the community

Problem

We see an issue where a some of vector instances are getting OOMkilled. The cluster in which the vector is running can have more than 3000 nodes. On the pod that got OOMkilled, at any point in time, its reading about 10 to 15 files. The log rotations is 10Mb x 5 files for each pod. Out of these 3 are .gz files which are excluded from reading by vector. We are using kubernetes_logs source, dedoting the keys using remap transform and pushing to kafka.

The memory request and limit are set to 750Mi each. Also after increasing the memory limit, we were able to make it work but the memory usage seems higher on 1 of the pod where the rate of incoming logs is about 150 messages / sec. Running vector on a node to read logs of 1 pod only, we are able to get very high performance of about reading 6000 messages / sec with memory usage of 100Mi.

The maximum value of vector_open_files in the cluster is 20

Configuration

acknowledgements:
      enabled: true
    api:
      address: 0.0.0.0:8686
      enabled: true
      playground: false
    data_dir: /vector-data-dir
    expire_metrics_secs: 300
    sinks:
      kafka:
        batch:
          max_bytes: 1000000
          max_events: 4000
          timeout_secs: 2
        bootstrap_servers: kafka:9092
        buffer:
          max_events: 1500
          type: memory
          when_full: block
        compression: zstd
        encoding:
          codec: json
        inputs:
        - dedot_keys
        librdkafka_options:
          client.id: vector
          request.required.acks: "1"
        message_timeout_ms: 300000
        topic: vector
        type: kafka
      prometheus_exporter:
        address: 0.0.0.0:9090
        buffer:
          max_events: 500
          type: memory
          when_full: block
        flush_period_secs: 60
        inputs:
        - internal_metrics
        type: prometheus_exporter
    sources:
      internal_metrics:
        type: internal_metrics
      kubernetes_logs:
        extra_namespace_label_selector: vector-control-plane=true
        glob_minimum_cooldown_ms: 3000
        ingestion_timestamp_field: ingest_timestamp
        namespace_annotation_fields:
          namespace_labels: ""
        pod_annotation_fields:
          container_id: ""
          container_image_id: ""
          pod_annotations: ""
          pod_owner: ""
          pod_uid: ""
        type: kubernetes_logs
        use_apiserver_cache: true
    transforms:
      dedot_keys:
        inputs:
        - kubernetes_logs
        source: ". = map_keys(., recursive: true) -> |key| { replace(key, \".\", \"_\")
          }      \n"
        type: remap

Version

0.39.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 1 month ago

Does Vector immediately run into the limit? Or does it look like it is increasing over time? One thing I can think of is setting https://vector.dev/docs/reference/configuration/global-options/#expire_metrics_secs in case it is the internal telemetry that is causing an a runway increase in memory.