Open aashery-square opened 1 year ago
It seems likely this could be a combination of file handles, as well as internal metric growth.
To confirm it's 32 pods to the single Vector pod collecting their logs?
@spencergilbert ah actually 30, not 32. But yes, all 30 pods are colocated on the same node as the vector daemon, and are logging to file via the json-file
log driver. Here's what that looks like:
Each deployment is labeled according to the size of logs it's producing. They each are emitting 100 logs/s except for the 128K deployment, which emits 1 log/s. I believe EKS has a default rotation size of 10MB, so the (estimated) rate at which we'd see file rotations should be ((1250 * 4 * 100) + (3000 * 20 * 100) + (10000 * 3 * 100) + (15000 * 2 * 100) + (128000 * 1 * 1)) / 10000000 = 1.2628
rotations/s.
I found the internal metrics events rate is constantly growing around the vectors fleet
@spencergilbert any updates here?
This sounds like it could partially be being caused by the known issue around runaway internal metric cardinality for the kubernetes_logs
source which tags some of its metrics with file
(ref: https://github.com/vectordotdev/vector/issues/15426). I see you are already setting https://vector.dev/docs/reference/configuration/global-options/#expire_metrics_secs to expire the internal metrics but maybe something isn't working there. You could monitor https://vector.dev/docs/reference/configuration/sources/internal_metrics/#internal_metrics_cardinality to see if that is always increasing or if it resets when metrics are expired.
Separately if it is that Vector has a lot of open file handles (have you confirmed this using lsof
or something else?). You could try tweaking some of the settings to see if you can improve the performance of that source. Users have seen success increasing https://vector.dev/docs/reference/configuration/sources/kubernetes_logs/#max_read_bytes to a higher value so that Vector reads more from each file before moving on.
Do you have a rough sense of the volume of logs per node?
@jszwedko thanks for the extra info! re internal metrics cardinality I'm including what we captured there:
I haven't actually investigated open file handle counts yet, do you know if there's an easy way to run lsof
using the native vector chart? I will also look into playing with max_read_bytes
.
I haven't actually investigated open file handle counts yet, do you know if there's an easy way to run
lsof
using the native vector chart? I will also look into playing withmax_read_bytes
.
If you're using the distroless image you'd need to use an ephemeral container, otherwise I expect lsof
would be available on the other images.
I am seeing a similar behavior with a Docker source on Kubernetes. I am still running vector 0.29.0 (x86_64-unknown-linux-gnu 33b3868 2023-04-11 22:19:54.512366263)
at the moment.
The Image shows two pods which grew memory usage until hitting Kubernetes limit of 196Mi and got restarted. The other two are for reference and not growing as fast (). This set of 4 is a selection out of a cluster with about 30 nodes where vector is running as a daemonset.
vector-config.yaml
api:
enabled: true
playground: false
data_dir: /opt/vector/data/
enrichment_tables:
... TWO CSV ENRICHTMENTS ...
sources:
docker:
type: docker_logs
multiline:
mode: continue_through
start_pattern: "^[^\\s]"
condition_pattern: "^[\\s]+"
timeout_ms: 2000
host_journal:
type: journald
current_boot_only: true
since_now: true
journal_directory: /var/log/journal/
transforms:
... QUITE A BIT OF TRANSFORMATIONS ...
sinks:
prometheus:
type: prometheus_exporter
inputs:
- vector_in
- vector_out
address: 0.0.0.0:9100
graylog:
type: socket
inputs:
- unified
- unified.dropped
address: graylog:12201
mode: tcp
encoding:
codec: gelf
buffer:
type: disk
max_size: 536870912 # 512 Mb
framing:
method: character_delimited
character_delimited:
delimiter: "\0"
At the moment all of them have similar memory consumption and similar amount of file descriptors. I will monitor the memory consumption and report back if a correlation with amount of file descriptors becomes apparent.
Hello @jszwedko, could this also be related to the memory leak issue #11025 that I'm getting since version 0.19?
I'm using the file
source and kafka
sink, every minute vector reads hundreds of file, send them to kafka and then delete the files.
The only solution to avoid a OOM crash is to restart the vector process periodically.
I double checked my assumption of the vector pods growing memory usage faster when many containers are started on there nodes. The following picture shows an aggregation per node (one vector instance running on each) counting how many distinct container ids did log something on them over the past 7 days.
The first two which bars are significantly higher then the others are exactly the two nodes where the vector pods got restarts from my earlier post https://github.com/vectordotdev/vector/issues/17313#issuecomment-1559534415. File descriptors checked with lsof are increasing as well. But confusingly the fd counts are only in the hundreds. I would have assumed multiple thousands before those becom an issue?
EDIT:
I gave the file descriptors count another look and got them from all running pods. This is the list:
POD FDs
vector-22xfg: 179
vector-2ncfz: 1295
vector-2vhvf: 179
vector-4cbpq: 1250
vector-5qt8v: 179
vector-6dn6f: 179
vector-74dct: 257
vector-8b75v: 183
vector-8l7fc: 179
vector-8n8kv: 218
vector-8xt69: 774
vector-b5rfs: 181
vector-b6b9w: 187
vector-bfvrs: 181
vector-d8fp8: 179
vector-f877w: 179
vector-frd45: 1447
vector-g6ddf: 257
vector-gzq46: 220
vector-h4nws: 257
vector-hbjnf: 1027
vector-jdfwg: 1503
vector-jl6d7: 718
vector-kwjsx: 218
vector-kx9qp: 1515
vector-n2gtt: 218
vector-pgspv: 181
vector-q5ktf: 848
vector-s2p9r: 183
vector-sbdxk: 201
vector-sjlc7: 183
vector-sjqqm: 181
vector-srrlf: 1127
vector-svlqx: 1240
vector-t8f7g: 179
vector-tf5n7: 218
vector-xkgwr: 257
vector-xpbzr: 218
vector-z56jz: 179
vector-zrdp9: 1274
vector-zvv7k: 415 <- this one has only few open file descriptors
confusingly this does not show a clear correlation between amount of open file descriptors and used memory??? Anyway the correlation with amount of started containers holds true.
Hi folks, I have upgraded new vector's version 0.33.0
and it looks like internal_metrics_cardinality_total
has been decrease to 0
but I still got memory OOM issue. That means this issue does not resolve, @jszwedko ?
My vector's spec:
vector.yaml: |
api:
enabled: true
sources:
kubernetes_logs:
type: kubernetes_logs
glob_minimum_cooldown_ms: 500
max_read_bytes: 2048000
oldest_first: true
...
cpu:
request: 50m
limit: 100m
memory:
request: 1000Mib
limit: 1000Mib
I upgraded to 0.34.1 and I'm seeing the same issue. Though it happens only one 1 pod out of many which they all have the same configuration. Not sure why.
We have pipelines for: internal_metrics -> datadog, and prometheus -> influx & kafka
Update: I removed the internal_metrics -> datadog pipeline (15:28 in the screenshot), the memory consumption is back to normal / stable
Update: I removed the internal_metrics -> datadog pipeline (15:28 in the screenshot), the memory consumption is back to normal / stable
It's an important clue to help to locate the root cause! @sharonx @jszwedko @spencergilbert
Re:
I removed the internal_metrics -> datadog pipeline (15:28 in the screenshot), the memory consumption is back to normal / stable
Sadly I have to day that my previous conclusion was wrong. The pod's memory was stable for a while (~2h) and the memory started growing again. The container was OOMKilled every 15min or so after.
i'm seeing the same thing and probably have even larger log load per node in kubernetes. I thought it was due to the s3 sink not sending to s3 or sending too small of files but when I switch it to the blackhole it continues. So it seems like the cause is the kubernetes source. I've even tried to remove the simple transition that was there
cc @wanjunlei
A note for the community
Problem
We're seeing this behavior for vector 0.29.1 on EKS 1.24. We are load testing vector by running it as a daemonset against pods writing files via the
json-file
log driver for containerd. There's 32 pods writing 100 logs/s each, so we're pushing 3200 logs/s through the vector daemon on the node. Log sizes vary from 1.25K-128K, with mostly smaller writes. We see unbounded memory growth consistently even when setting a blackhole sink. Based on log throughput and EKS' rotation size I'd imagine that we're rotating through a ton of files, so we believe that Vector is probably hoarding open file handles.Setup
Chart.yaml
Values.yaml
Vector Config
The redacted remaps below are just adding additional static tags.
Configuration
No response
Version
0.29.1
Debug Output
No response
Example Data
Additional Context
No response
References
14750