Open adolsalamanca opened 8 months ago
I can't reproduce the issue if I just run timberio/vector:0.27.X-alpine
in Docker.
By the sounds of it the source isn't able to access /sys/fs/cgroup
. It's possible Kubernetes is doing something funky with this. Are you able to run a shell inside the pod and ls /sys/fs/cgroup
?
Thanks for your response @StephenWakely Yeah, it's not something super common, just happened a few times since we started scrapping host metrics. I'll try to run this if the error appears once again, and keep you posted.
I wonder if it is an issue of the cgroups structure changing while we are trying to read the files. That would introduce a race condition that would produce those "file not found" errors. Do you know if this happens more frequently when the contents of the pods are changing, @adolsalamanca?
Thanks for your message @bruceg
Sounds like a reasonable assumption, but I don't have much data to assess it right now. All I could do is wait for the problem to appear again, check and compare IOPS of that node with others.
It wasn't the first time, but didn't happen again since we reported it 😞
To be clear, it's not so much a matter of IOPS but that of pods changing, that is containers being started or stopped. From comparing your error to what is in the source code, I can definitely see the possibility for this to be happening, I'm just curious if that is what is happening to you.
Thanks for your response @bruceg Doesn't seem to be something that happens that often, below results getting agents from two of our clusters:
kubectl get pods -l app.kubernetes.io/instance=cgroups-metrics
NAME READY STATUS RESTARTS AGE
cgroups-metrics-2dsbl 1/1 Running 0 66d
cgroups-metrics-2kvzl 1/1 Running 0 66d
cgroups-metrics-2mpb5 1/1 Running 0 66d
cgroups-metrics-42r7p 1/1 Running 0 66d
cgroups-metrics-4cpd6 1/1 Running 0 66d
cgroups-metrics-4hrrb 1/1 Running 0 16d
cgroups-metrics-4ktmm 1/1 Running 0 66d
cgroups-metrics-4ns5r 1/1 Running 0 50d
cgroups-metrics-4p88m 1/1 Running 0 66d
cgroups-metrics-4pk8t 1/1 Running 0 66d
cgroups-metrics-5d7vv 1/1 Running 0 66d
cgroups-metrics-5nbg4 1/1 Running 0 66d
cgroups-metrics-64zcj 1/1 Running 0 66d
cgroups-metrics-6756r 1/1 Running 0 66d
cgroups-metrics-6h5bv 1/1 Running 0 66d
cgroups-metrics-6tbb2 1/1 Running 0 66d
cgroups-metrics-6zfv5 1/1 Running 0 66d
cgroups-metrics-78ftx 1/1 Running 0 45d
cgroups-metrics-8vkml 1/1 Running 0 66d
cgroups-metrics-9569m 1/1 Running 0 66d
cgroups-metrics-9kxht 1/1 Running 0 66d
cgroups-metrics-9mjxb 1/1 Running 0 66d
cgroups-metrics-9tnsw 1/1 Running 0 66d
cgroups-metrics-b4m64 1/1 Running 0 66d
cgroups-metrics-bgdxb 1/1 Running 0 66d
cgroups-metrics-bqttv 1/1 Running 0 43d
cgroups-metrics-br9n2 1/1 Running 0 8d
cgroups-metrics-brd98 1/1 Running 0 66d
cgroups-metrics-c25pp 1/1 Running 0 66d
cgroups-metrics-c4mzd 1/1 Running 0 66d
cgroups-metrics-cftfh 1/1 Running 0 66d
cgroups-metrics-cqv4x 1/1 Running 0 66d
cgroups-metrics-crcqg 1/1 Running 0 66d
cgroups-metrics-csrhd 1/1 Running 0 66d
cgroups-metrics-ctft9 1/1 Running 0 66d
cgroups-metrics-djhs2 1/1 Running 0 66d
cgroups-metrics-dkdd8 1/1 Running 0 66d
cgroups-metrics-fr4b2 1/1 Running 0 66d
cgroups-metrics-fz44n 1/1 Running 0 66d
cgroups-metrics-gczlh 1/1 Running 0 66d
cgroups-metrics-ghzvq 1/1 Running 0 66d
cgroups-metrics-gnjv7 1/1 Running 0 66d
cgroups-metrics-gp4zk 1/1 Running 0 14d
cgroups-metrics-h2v4p 1/1 Running 0 43d
cgroups-metrics-hc6tt 1/1 Running 0 26d
cgroups-metrics-hgl6r 1/1 Running 0 66d
cgroups-metrics-hjx2p 1/1 Running 0 66d
cgroups-metrics-hzmh9 1/1 Running 0 42d
cgroups-metrics-j9zqs 1/1 Running 0 66d
cgroups-metrics-jccp4 1/1 Running 0 66d
cgroups-metrics-jxc4x 1/1 Running 0 45d
cgroups-metrics-k4g8r 1/1 Running 0 6d3h
cgroups-metrics-k9d78 1/1 Running 0 66d
cgroups-metrics-kddk6 1/1 Running 0 66d
cgroups-metrics-kvnnm 1/1 Running 0 66d
cgroups-metrics-l2tql 1/1 Running 0 66d
cgroups-metrics-l8clx 1/1 Running 0 66d
cgroups-metrics-lh8hf 1/1 Running 0 6d3h
cgroups-metrics-lk842 1/1 Running 0 24d
cgroups-metrics-ls8qd 1/1 Running 0 66d
cgroups-metrics-lv7f6 1/1 Running 0 66d
cgroups-metrics-m8jzh 1/1 Running 0 66d
cgroups-metrics-mmsh7 1/1 Running 0 66d
cgroups-metrics-mnxtp 1/1 Running 0 43d
cgroups-metrics-mtz52 1/1 Running 0 23d
cgroups-metrics-mvkdj 1/1 Running 0 66d
cgroups-metrics-n9qnt 1/1 Running 0 66d
cgroups-metrics-npfd5 1/1 Running 0 66d
cgroups-metrics-nxwvn 1/1 Running 0 66d
cgroups-metrics-p4zxj 1/1 Running 0 66d
cgroups-metrics-p5grq 1/1 Running 0 66d
cgroups-metrics-p7bjf 1/1 Running 0 66d
cgroups-metrics-p7tk2 1/1 Running 0 66d
cgroups-metrics-ptgkl 1/1 Running 0 66d
cgroups-metrics-pvf4d 1/1 Running 0 66d
cgroups-metrics-q2xgq 1/1 Running 0 66d
cgroups-metrics-q69ws 1/1 Running 0 66d
cgroups-metrics-qdggd 1/1 Running 0 45d
cgroups-metrics-qq2wf 1/1 Running 0 66d
cgroups-metrics-r76c2 1/1 Running 0 66d
cgroups-metrics-rxkll 1/1 Running 0 66d
cgroups-metrics-s28jf 1/1 Running 1 (21d ago) 66d
cgroups-metrics-sb46m 1/1 Running 0 66d
cgroups-metrics-sjpb2 1/1 Running 0 66d
cgroups-metrics-sqnkg 1/1 Running 0 66d
cgroups-metrics-sxzgh 1/1 Running 0 64d
cgroups-metrics-t5c29 1/1 Running 0 66d
cgroups-metrics-v6dqk 1/1 Running 0 42d
cgroups-metrics-v9r45 1/1 Running 0 66d
cgroups-metrics-vdphv 1/1 Running 0 66d
cgroups-metrics-vgt2l 1/1 Running 0 66d
cgroups-metrics-vj2ml 1/1 Running 0 66d
cgroups-metrics-vnqgc 1/1 Running 0 66d
cgroups-metrics-vzxnq 1/1 Running 0 63d
cgroups-metrics-w75rg 1/1 Running 0 66d
cgroups-metrics-w9sj5 1/1 Running 0 36d
cgroups-metrics-wmxpz 1/1 Running 0 66d
cgroups-metrics-wrx4v 1/1 Running 0 66d
cgroups-metrics-wtx5q 1/1 Running 0 66d
cgroups-metrics-wxwch 1/1 Running 0 66d
cgroups-metrics-x2tjl 1/1 Running 0 43d
cgroups-metrics-xk7tm 1/1 Running 0 66d
cgroups-metrics-z7sp2 1/1 Running 0 66d
cgroups-metrics-zccqd 1/1 Running 0 66d
kubectl -n $sav get pods -l app.kubernetes.io/instance=cgroups-metrics
NAME READY STATUS RESTARTS AGE
cgroups-metrics-2kbg2 1/1 Running 0 31d
cgroups-metrics-2zb52 1/1 Running 0 2d11h
cgroups-metrics-444vm 1/1 Running 0 45h
cgroups-metrics-4n4kr 1/1 Running 0 31d
cgroups-metrics-4n4td 1/1 Running 0 31d
cgroups-metrics-4rdwh 1/1 Running 0 31d
cgroups-metrics-5l59j 1/1 Running 0 31d
cgroups-metrics-6bdlv 1/1 Running 0 31d
cgroups-metrics-7vglz 1/1 Running 0 31d
cgroups-metrics-8jgkn 1/1 Running 0 31d
cgroups-metrics-9fcpp 1/1 Running 0 31d
cgroups-metrics-9kl2n 1/1 Running 0 31d
cgroups-metrics-9n6wr 1/1 Running 0 31d
cgroups-metrics-9pps2 1/1 Running 2 (3d5h ago) 31d
cgroups-metrics-9w27x 1/1 Running 0 31d
cgroups-metrics-b6lgq 1/1 Running 0 43h
cgroups-metrics-b7bl5 1/1 Running 0 31d
cgroups-metrics-bf8cl 1/1 Running 0 31d
cgroups-metrics-bnspl 1/1 Running 0 13d
cgroups-metrics-cct2c 1/1 Running 0 2d4h
cgroups-metrics-d7dt4 1/1 Running 0 31d
cgroups-metrics-dbjmj 1/1 Running 0 10d
cgroups-metrics-g4nr5 1/1 Running 0 31d
cgroups-metrics-gbxzl 1/1 Running 0 2d15h
cgroups-metrics-hbhdk 1/1 Running 0 31d
cgroups-metrics-jldfk 1/1 Running 1 (13d ago) 31d
cgroups-metrics-jzlwq 1/1 Running 0 15d
cgroups-metrics-kg29s 1/1 Running 0 31d
cgroups-metrics-kk6x2 1/1 Running 0 31d
cgroups-metrics-kx4tz 1/1 Running 0 31d
cgroups-metrics-lrxbt 1/1 Running 0 31d
cgroups-metrics-mgqwb 1/1 Running 0 31d
cgroups-metrics-nc4mp 1/1 Running 0 31d
cgroups-metrics-pnr2q 1/1 Running 0 31d
cgroups-metrics-pr8vm 1/1 Running 0 31d
cgroups-metrics-q4gxl 1/1 Running 0 31d
cgroups-metrics-qb6dp 1/1 Running 0 31d
cgroups-metrics-qsqcb 1/1 Running 0 31d
cgroups-metrics-rltft 1/1 Running 0 31d
cgroups-metrics-rn68q 1/1 Running 0 31d
cgroups-metrics-rxnmg 1/1 Running 0 31d
cgroups-metrics-srshk 1/1 Running 0 31d
cgroups-metrics-sxr2s 1/1 Running 0 34h
cgroups-metrics-t642x 1/1 Running 0 31d
cgroups-metrics-t8fst 1/1 Running 0 31d
cgroups-metrics-tnpdv 1/1 Running 0 31d
cgroups-metrics-tnsbv 1/1 Running 0 31d
cgroups-metrics-tqh6k 1/1 Running 0 14d
cgroups-metrics-v2r4g 1/1 Running 0 2d17h
cgroups-metrics-wk6nn 1/1 Running 0 31d
cgroups-metrics-x275s 1/1 Running 0 31d
cgroups-metrics-x8jkt 1/1 Running 0 31d
cgroups-metrics-xh2f5 1/1 Running 0 31d
cgroups-metrics-xrxvf 1/1 Running 0 31d
cgroups-metrics-z28t4 1/1 Running 0 31d
cgroups-metrics-z8crf 1/1 Running 0 31d
cgroups-metrics-zk66t 1/1 Running 1 (3d14h ago) 29d
cgroups-metrics-zxln4 1/1 Running 0 10d
Right, I would expect it to be fairly rare. It would happen where the list of groups changed between listing the directory and reading the contents of the files in it. It would probably help if Vector was busy at the same time and ended up scheduling away from the host metrics task internally.
Another occurrence, leaving logs below
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.710012Z INFO vector::app: Internal log rate limit configured. internal_log_rate_secs=10
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.712114Z INFO vector::app: Log level is enabled. level="info"
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.713223Z INFO vector::app: Loading configs. paths=["/etc/vector"]
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.789217Z WARN vector::config::loading: Transform "cgroup_router._unmatched" has no consumers
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.796030Z INFO source{component_kind="source" component_id=cgroup_metrics component_type=host_metrics component_name=cgroup_metrics}: vector::sources::host_metrics: PROCFS_ROOT is set in envvars. Using custom for procfs. custom="/host/proc"
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.796043Z INFO source{component_kind="source" component_id=cgroup_metrics component_type=host_metrics component_name=cgroup_metrics}: vector::sources::host_metrics: SYSFS_ROOT is set in envvars. Using custom for sysfs. custom="/host/sys"
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.815354Z ERROR vector_buffers::variants::disk_v2::writer: Last written record was unable to be deserialized. Corruption likely. reason="invalid data: check failed for struct member payload: pointer out of bounds: base 0x7fbf8cb39ff8 offset -1377203812 not in range 0x7fbf91ca8000..0x7fbf91cdd000"
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.243927Z INFO vector::topology::running: Running healthchecks.
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.244036Z INFO vector::topology::builder: Healthcheck passed.
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.244222Z INFO vector: Vector has started. debug="false" version="0.27.1" arch="x86_64" revision="19a51f2 2023-02-22"
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.244513Z INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:9090
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.251556Z INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=off
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.269963Z INFO vector::topology::builder: Healthcheck passed.
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.274133Z INFO vector::topology::builder: Healthcheck passed.
cgroups-metrics-z64mh vector 2024-01-31T13:04:02.327559Z ERROR source{component_kind="source" component_id=cgroup_metrics component_type=host_metrics component_name=cgroup_metrics}: vector::internal_events::host_metrics: Failed to load cgroups children. error=No such file or directory (os error 2) error_type="reader_failed" stage="receiving" internal_log_rate_limit=true
A note for the community
Problem
We're running vector as a daemonset in our k8s cluster. All of a sudden we started getting errors about host_metrics, we're only using cgroups collector on it.
Configuration
Version
timberio/vector:0.27.X-alpine
Debug Output
Example Data
No response
Additional Context
No response
References
No response