vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
16.99k stars 1.46k forks source link

Unbounded memory growth for high throughput kubernetes_logs #17313

Open aashery-square opened 1 year ago

aashery-square commented 1 year ago

A note for the community

Problem

We're seeing this behavior for vector 0.29.1 on EKS 1.24. We are load testing vector by running it as a daemonset against pods writing files via the json-file log driver for containerd. There's 32 pods writing 100 logs/s each, so we're pushing 3200 logs/s through the vector daemon on the node. Log sizes vary from 1.25K-128K, with mostly smaller writes. We see unbounded memory growth consistently even when setting a blackhole sink. Based on log throughput and EKS' rotation size I'd imagine that we're rotating through a ton of files, so we believe that Vector is probably hoarding open file handles.

Setup

Chart.yaml

apiVersion: v2
name: service
type: application
version: 1.0.0
dependencies:
  - name: vector
    version: 0.21.1
    repository: "https://helm.vector.dev"

Values.yaml

podLabels:
  tags.datadoghq.com/version: 0.29.1
resources:
  limits:
    cpu: 3000m
    memory: 1Gi
  requests:
    cpu: 3000m
    memory: 1Gi

Vector Config

The redacted remaps below are just adding additional static tags.

data_dir: /vector-data-dir
expire_metrics_secs: 60

api:
  enabled: true
  address: 127.0.0.1:8686
  playground: false

sources:
  internal_logs:
    type: internal_logs
  internal_metrics:
    type: internal_metrics
    scrape_interval_secs: 30
    tags:
      host_key: host
  k8s_logs:
    type: kubernetes_logs
    glob_minimum_cooldown_ms: 15000

transforms:
  vector_tagging:
    type: remap
    inputs:
      - internal_metrics
    source: |
      .tags.region = get_env_var!("AWS_REGION")
      ...(REDACTED)...
  log_tagging:
    type: remap
    inputs:
      - internal_logs
    source: |
      .service = "vector-daemonset"
      ...(REDACTED)...
sinks:
  datadog_metrics:
    type: datadog_metrics
    inputs:
      - vector_tagging
    default_api_key: ${DATADOG_API_KEY}
  datadog_logs:
    type: datadog_logs
    inputs:
      - log_tagging
    default_api_key: ${DATADOG_API_KEY}
  null_out:
    type: blackhole
    inputs:
      - k8s_logs

Configuration

No response

Version

0.29.1

Debug Output

No response

Example Data

image

Additional Context

No response

References

14750

spencergilbert commented 1 year ago

It seems likely this could be a combination of file handles, as well as internal metric growth.

To confirm it's 32 pods to the single Vector pod collecting their logs?

aashery-square commented 1 year ago

@spencergilbert ah actually 30, not 32. But yes, all 30 pods are colocated on the same node as the vector daemon, and are logging to file via the json-file log driver. Here's what that looks like:

339B6C72-8747-43E1-8DCF-9345BB701014_1_201_a

Each deployment is labeled according to the size of logs it's producing. They each are emitting 100 logs/s except for the 128K deployment, which emits 1 log/s. I believe EKS has a default rotation size of 10MB, so the (estimated) rate at which we'd see file rotations should be ((1250 * 4 * 100) + (3000 * 20 * 100) + (10000 * 3 * 100) + (15000 * 2 * 100) + (128000 * 1 * 1)) / 10000000 = 1.2628 rotations/s.

legigor commented 1 year ago

I found the internal metrics events rate is constantly growing around the vectors fleet

CleanShot 2023-05-11 at 10 56 54@2x
aashery-square commented 1 year ago

@spencergilbert any updates here?

jszwedko commented 1 year ago

This sounds like it could partially be being caused by the known issue around runaway internal metric cardinality for the kubernetes_logs source which tags some of its metrics with file (ref: https://github.com/vectordotdev/vector/issues/15426). I see you are already setting https://vector.dev/docs/reference/configuration/global-options/#expire_metrics_secs to expire the internal metrics but maybe something isn't working there. You could monitor https://vector.dev/docs/reference/configuration/sources/internal_metrics/#internal_metrics_cardinality to see if that is always increasing or if it resets when metrics are expired.

Separately if it is that Vector has a lot of open file handles (have you confirmed this using lsof or something else?). You could try tweaking some of the settings to see if you can improve the performance of that source. Users have seen success increasing https://vector.dev/docs/reference/configuration/sources/kubernetes_logs/#max_read_bytes to a higher value so that Vector reads more from each file before moving on.

Do you have a rough sense of the volume of logs per node?

aashery-square commented 1 year ago

@jszwedko thanks for the extra info! re internal metrics cardinality I'm including what we captured there:

image

I haven't actually investigated open file handle counts yet, do you know if there's an easy way to run lsof using the native vector chart? I will also look into playing with max_read_bytes.

spencergilbert commented 1 year ago

I haven't actually investigated open file handle counts yet, do you know if there's an easy way to run lsof using the native vector chart? I will also look into playing with max_read_bytes.

If you're using the distroless image you'd need to use an ephemeral container, otherwise I expect lsof would be available on the other images.

scMarkus commented 1 year ago

I am seeing a similar behavior with a Docker source on Kubernetes. I am still running vector 0.29.0 (x86_64-unknown-linux-gnu 33b3868 2023-04-11 22:19:54.512366263) at the moment. The Image shows two pods which grew memory usage until hitting Kubernetes limit of 196Mi and got restarted. The other two are for reference and not growing as fast (). This set of 4 is a selection out of a cluster with about 30 nodes where vector is running as a daemonset. image

vector-config.yaml

    api:
      enabled: true
      playground: false
    data_dir: /opt/vector/data/
    enrichment_tables:
    ... TWO CSV ENRICHTMENTS ...
    sources:
      docker:
        type: docker_logs
        multiline:
          mode: continue_through
          start_pattern: "^[^\\s]"
          condition_pattern: "^[\\s]+"
          timeout_ms: 2000
      host_journal:
        type: journald
        current_boot_only: true
        since_now: true
        journal_directory: /var/log/journal/
    transforms:
    ... QUITE A BIT OF TRANSFORMATIONS ...
    sinks:
      prometheus:
        type: prometheus_exporter
        inputs:
          - vector_in
          - vector_out
        address: 0.0.0.0:9100
      graylog:
        type: socket
        inputs:
          - unified
          - unified.dropped
        address: graylog:12201
        mode: tcp
        encoding:
          codec: gelf
        buffer:
          type: disk
          max_size: 536870912 # 512 Mb
        framing:
          method: character_delimited
          character_delimited:
            delimiter: "\0"

At the moment all of them have similar memory consumption and similar amount of file descriptors. I will monitor the memory consumption and report back if a correlation with amount of file descriptors becomes apparent.

leandrojmp commented 1 year ago

Hello @jszwedko, could this also be related to the memory leak issue #11025 that I'm getting since version 0.19?

I'm using the file source and kafka sink, every minute vector reads hundreds of file, send them to kafka and then delete the files.

The only solution to avoid a OOM crash is to restart the vector process periodically.

scMarkus commented 1 year ago

I double checked my assumption of the vector pods growing memory usage faster when many containers are started on there nodes. The following picture shows an aggregation per node (one vector instance running on each) counting how many distinct container ids did log something on them over the past 7 days.

image

The first two which bars are significantly higher then the others are exactly the two nodes where the vector pods got restarts from my earlier post https://github.com/vectordotdev/vector/issues/17313#issuecomment-1559534415. File descriptors checked with lsof are increasing as well. But confusingly the fd counts are only in the hundreds. I would have assumed multiple thousands before those becom an issue?

EDIT:

I gave the file descriptors count another look and got them from all running pods. This is the list:

POD           FDs
vector-22xfg: 179
vector-2ncfz: 1295
vector-2vhvf: 179
vector-4cbpq: 1250
vector-5qt8v: 179
vector-6dn6f: 179
vector-74dct: 257
vector-8b75v: 183
vector-8l7fc: 179
vector-8n8kv: 218
vector-8xt69: 774
vector-b5rfs: 181
vector-b6b9w: 187
vector-bfvrs: 181
vector-d8fp8: 179
vector-f877w: 179
vector-frd45: 1447
vector-g6ddf: 257
vector-gzq46: 220
vector-h4nws: 257
vector-hbjnf: 1027
vector-jdfwg: 1503
vector-jl6d7: 718
vector-kwjsx: 218
vector-kx9qp: 1515
vector-n2gtt: 218
vector-pgspv: 181
vector-q5ktf: 848
vector-s2p9r: 183
vector-sbdxk: 201
vector-sjlc7: 183
vector-sjqqm: 181
vector-srrlf: 1127
vector-svlqx: 1240
vector-t8f7g: 179
vector-tf5n7: 218
vector-xkgwr: 257
vector-xpbzr: 218
vector-z56jz: 179
vector-zrdp9: 1274
vector-zvv7k: 415 <- this one has only few open file descriptors

confusingly this does not show a clear correlation between amount of open file descriptors and used memory??? Anyway the correlation with amount of started containers holds true.

eddy1o2 commented 9 months ago

Hi folks, I have upgraded new vector's version 0.33.0 and it looks like internal_metrics_cardinality_total has been decrease to 0 but I still got memory OOM issue. That means this issue does not resolve, @jszwedko ? image image

My vector's spec:

sharonx commented 6 months ago

I upgraded to 0.34.1 and I'm seeing the same issue. Though it happens only one 1 pod out of many which they all have the same configuration. Not sure why.

image

We have pipelines for: internal_metrics -> datadog, and prometheus -> influx & kafka

Update: I removed the internal_metrics -> datadog pipeline (15:28 in the screenshot), the memory consumption is back to normal / stable image

benjaminhuo commented 6 months ago

Update: I removed the internal_metrics -> datadog pipeline (15:28 in the screenshot), the memory consumption is back to normal / stable

It's an important clue to help to locate the root cause! @sharonx @jszwedko @spencergilbert

sharonx commented 6 months ago

Re:

I removed the internal_metrics -> datadog pipeline (15:28 in the screenshot), the memory consumption is back to normal / stable

Sadly I have to day that my previous conclusion was wrong. The pod's memory was stable for a while (~2h) and the memory started growing again. The container was OOMKilled every 15min or so after.

mzupan commented 4 months ago

i'm seeing the same thing and probably have even larger log load per node in kubernetes. I thought it was due to the s3 sink not sending to s3 or sending too small of files but when I switch it to the blackhole it continues. So it seems like the cause is the kubernetes source. I've even tried to remove the simple transition that was there

benjaminhuo commented 4 months ago

cc @wanjunlei