Closed MaxRink closed 1 year ago
Hi @MaxRink !
Vector will hold onto deleted files until it has fully read them. Can you tell if that's what is happening here?
They got rotated (and after a few iterations deleted ) by logrotate or by the k8s internal apiserver logrotate
@MaxRink makes sense, but do you know if Vector is still reading from them? Attaching strace
should tell you.
i dont think it is.
im not seeing any output from strace -p 1326978 -e trace=file
(where 1326978 is the vector pid)
edit: forgot to do recursive, as all the io happens in child processes as it seems but im not seeing access to the files above in there either https://gist.github.com/MaxRink/05dd37f6d3daf7b996821eb46b85ee29
We're also seeing this issue -- maybe it's because we're not reaching EOF for some reason? Not sure why that would be happening, though.
Also faced this issue last night, vector keeps file descriptors for rotated logs, any known workaround for that?
https://github.com/vectordotdev/vector/pull/17603 may fix this.
:+1: We have this issue reported for several large customers in https://issues.redhat.com/browse/LOG-3949 Fluentd implementations has a similar issue but they also have a config which allows them to not necessarily read entirely from rotated files by introducing a delay config parameter which may be a useful feature instead of always trying to read to EOF
👍 We have this issue reported for several large customers in https://issues.redhat.com/browse/LOG-3949 Fluentd implementations has a similar issue but they also have a config which allows them to not necessarily read entirely from rotated files by introducing a delay config parameter which may be a useful feature instead of always trying to read to EOF
The current implementation of the file
source is expected to tail files until they are deleted. Once they are deleted, and EOF has been read, the handle will be released. If users are hitting a condition under which Vector has read to EOF of deleted files and still hasn't released the file handle this would be a bug. We haven't been able to reproduce that behavior as of yet though so would love a reproduction case 🙂 .
Having a similar parameter to rotate_wait
from Fluentd would cause less pressure but would risk missing more logs if Vector hadn't finished reading the deleted file.
https://github.com/vectordotdev/vector/pull/17603#issuecomment-1590141533 steps through how this logic of reading deleted files until EOF works in Vector.
@jszwedko we are using the kubernetes source? Does that matter?
@jszwedko we are using the kubernetes source? Does that matter?
It shouldn't. The behavior is the same since they share the same underlying file tailing implementation.
Having a similar parameter to rotate_wait from Fluentd would cause less pressure but would risk missing more logs if Vector hadn't finished reading the deleted file.
@jszwedko I would like to revisit this as I realize the ramification of adding such a parameter is log loss; we see it now with fluentd. It's naive to think, however, the collector and host resources are infinite and the collector is always able to keep up with the log volume. We routinely push the collector on OpenShift clusters to the point where vector as unable to keep up with the load. As an admin is it better for me to "never lose a log" or to continue to collect? A configuration point would be an "opt-in" choice where admins know the trade-offs. I suspect it could be instrumented as well with a metric to identify when didn't reach EOF or something similar upon which to write an alert against.
Hi team, I also faced this issue. Have anyone solve it yet? Basically,
du
on the node) - ours is around 10Gb
lsof -nP +L1
to list non-closed file descriptors, we find ~90Gb
due to vector - this is in-line with what df
shows - so basically those log files are supposed to be deleted (due to log rotation) but they aren’t and still occupying disk space because vector as a process is still holding onto it
My vector specs:
0.28.0-debian
sources:
kubernetes_logs:
type: kubernetes_logs
glob_minimum_cooldown_ms: 500
max_read_bytes: 2048000
https://github.com/vectordotdev/vector/issues/11742#issuecomment-1675466517 is the expectation. In your case, it means that Vector is still reading from those deleted files. If you believe that not to be the case, we would definitely appreciate a reproduction case for this as we've tried a few times but haven't been able to reproduce behavior where Vector doesn't release the file handle once it gets to EOF for a deleted file.
I see but in https://github.com/vectordotdev/vector/issues/11742#issuecomment-1675466517, you have mentioned the file
source while we are using kubernetes_log
source. Is it still expected?
Or it related to the fix of https://github.com/vectordotdev/vector/issues/18088 for kubenetes_log
source?
I see but in #11742 (comment), you have mentioned the
file
source while we are usingkubernetes_log
source. Is it still expected?
Ah, yes, they use the same underlying mechanisms for file reading so the behavior, in this respect, should be the same. Thanks for clarifying!
After discussing this one with the team we came to the following conclusion:
rotate_wait
that is opt-in and would mean Vector doesn't read to EOF.cc @wanjunlei
A note for the community
Problem
We ran into the issue that Vector is keeping old Inodes open, thus filling up the disk
INodes: https://gist.github.com/MaxRink/ee056e27a4b11a7b710e437e1f892984 After pkilling vector disk usage returned to normal
Version
0.20.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response