Open jszwedko opened 4 years ago
So description for ignore_older
option is somewhat misleading/incomplete. Currently it is:
Ignore files with a data modification date that does not exceed this age.
which doesn't say that it will collect newer data from the file. That's why it's tailing them, as it only ignores the old data not the file. So a clearer description would be:
Ignore existing data in files with a data modification date older than this age. Subsequent data will be collected.
This is the expected behavior, so just the documentation should be updated.
Hm, that's not how I thought the option to works. I think the behavior should change. The idea is to ignore older files as if they don't exist, not just older data. This ensures that Vector does not open file handles for the file, etc.
@ktff, we're working on an RFC to improve the file source in #3480. I think we should address this there.
The idea is to ignore older files as if they don't exist, not just older data.
What would you want to happen with a file that initially has not been modified since ignore_older
but then starts getting writes? If you want to start reading only the new data, we'd need to keep track of where that starts somehow. Right now we do it with the file cursor.
That's a good point. Ideally, we'd rediscover the file and read the new data only, but in the interim we would not hold a file handle for it.
That's a good point. Ideally, we'd rediscover the file and read the new data only, but in the interim we would not hold a file handle for it.
I think I like that, but it does seem like it would make it trickier to figure out which file contents were new. I guess you could keep track of the sizes of the ignored files and then seek to that point when their mtime changes? That wouldn't handle the case that the file is overwritten, rather than appended to, though.
I guess you could keep track of the sizes of the ignored files and then seek to that point when their mtime changes? That wouldn't handle the case that the file is overwritten, rather than appended to, though.
Yeah, you'd at least need to make sure the fingerprint was the same or weird things could happen.
Another confusing case would be if Vector is started, ignores a file for an old mtime, then Vector is stopped, then the file is written to, then Vector is started again. At that point, it would (depending on the config) read the whole file from the beginning, including data it had been ignoring to that point. Right now I think we avoid that via the normal checkpointing of open files.
I think the behavior should change
In that case whatever solution we come up with to avoid holding file handles it should be useable for other files as well. So we can reduce total file handle usage further. This way we can avoid some of the special casing and use the normal checkpointing to avoid the issues @lukesteensen mentioned.
Is there a workaround to that issue?
In my use-case, there are many short-lived jobs that generate roughly 100000 files per month. Now due to business requirements, we can't delete the files before that.
I'm looking for a way to limit the number of tailed files somehow (I.E by looking at ctime). Any way to do that?
@vbichov
Is there a workaround to that issue?
One way is to point Vector to a folder with symlinks to the files and then have two services/scripts that will create symlinks and delete older ones.
Hi, are you planning to make changes so that this parameter will force the vector to ignore old files and not open the handle?
Hi, I did change in /lib/file-source/src/file_watcher/mod.rs that FileWatcher.new() function returned a FileWatcher structure with the variable is_dead: true if variable too_old is true. So the old files were not opened and the vector works with directories where are a lot of old files. But there are 2 problems:
If such functionality appeared in future versions it would be very good, since now it is not possible to use it due to the large number of open files with directories where are a lot of old files
I've been using a workaround where, periodically, I add patterns for old files to my exclude
config so that it doesn't consider older files.
exclude = [
# hack to reduce start-up time and file descriptor usage
"**/2022-*",
"**/2023-01-*",
"**/2023-02-*",
"**/2023-03-*",
"**/2023-04-*",
"**/2023-05-*",
"**/2023-06-*",
]
This is a pain to keep up to date and it would be great if I only had to set ignore_older_secs
.
It would be nice if this file sink syntax:
[sinks.my_sink_id]
type = "file"
inputs = [ "my-source-or-transform-id" ]
path = "/tmp/vector-%Y-%m-%d.log"
Would work to apply a date string variable to the file source, such as this:
[sources.my_source_id]
type = "file"
include = [ "/var/log/**/%Y-%m-%d*.log" ]
This would let vector include files with timestamps that are generated in real time by the remote applications, while ignoring the older files. This would look like an enhanced glob match were some variables are included and must be resolved prior to the glob string being applied.
This would solve @ethack's issue, as well as a number of other applications that I've seen where datestamps are included in the active log file and where rotation is frequent, leading to many open file handles and heavy load by Vector to watch these files despite tuning.
Coming back to this, I think that Vector could avoid keeping an open file handle to ignored files by:
@lukesteensen curious if you have thoughts.
Yeah, I think the best solution is likely to introduce another state to the file watcher where we still checkpoint the EOF but don't hold an active file handle. Right now it's basically all or nothing: we're either actively watching with an open handle or we ignore it entirely via exclude
. Some kind of passive watching state for old/idle files would allow us to retain checkpoints in case the file receives future writes, but not take up a file handle and time polling for reads.
Our workflow involves generating numerous files daily, with only a few requiring active updates. The necessity to manually update the "exclude" list to manage resources effectively has become a significant operational burden.
A feature that allows Vector to intelligently ignore files based on their modification date, without maintaining open file handles, would greatly alleviate our current struggles. This would optimize resource usage and reduce manual overhead in our workflow. We strongly believe that such a feature would benefit many users facing similar challenges and hope to see it prioritized in Vector's development roadmap.
Yeah, I think the best solution is likely to introduce another state to the file watcher where we still checkpoint the EOF but don't hold an active file handle. Right now it's basically all or nothing: we're either actively watching with an open handle or we ignore it entirely via
exclude
. Some kind of passive watching state for old/idle files would allow us to retain checkpoints in case the file receives future writes, but not take up a file handle and time polling for reads.
Something like this would work, you think? https://github.com/notify-rs/notify/blob/main/examples/async_monitor.rs
As @fitz123 noted in previous comment above
The necessity to manually update the "exclude" list to manage resources effectively has become a significant operational burden.
It's been nearly 4 months since I commented on this issue and unfortunately haven't received feedback. And we, unfortunately, are facing the same burden and this is becoming more pressing.
Is there near-term interest and/or initiative to improve this, perhaps in a way similar to what has been proposed in comment above by @lukesteensen ?
Some kind of passive watching state for old/idle files would allow us to retain checkpoints in case the file receives future writes, but not take up a file handle and time polling for reads.
What do you think of my proposition in my previous comment (notify-rs async monitor) as an improvement to the current file watcher implementation?
I think we'd be open to a pull request that implemented what I proposed in https://github.com/vectordotdev/vector/issues/3567#issuecomment-1686783252
As an enhancement we could also rely on inotify when available, but I don't think we can exclusively move to it due to lack of support on Windows (and other platforms?).
As an enhancement we could also rely on inotify when available, but I don't think we can exclusively move to it due to lack of support on Windows (and other platforms?).
https://github.com/notify-rs/notify/ is a Cross-platform filesystem notification library for Rust, not only for Linux (inotify)
Currently:
Linux / Android: inotify macOS: FSEvents or kqueue, see features Windows: ReadDirectoryChangesW iOS / FreeBSD / NetBSD / OpenBSD / DragonflyBSD: kqueue All platforms: polling
See: https://github.com/notify-rs/notify/?tab=readme-ov-file#platforms
As an enhancement we could also rely on inotify when available, but I don't think we can exclusively move to it due to lack of support on Windows (and other platforms?).
https://github.com/notify-rs/notify/ is a Cross-platform filesystem notification library for Rust, not only for Linux (inotify)
Currently:
Linux / Android: inotify macOS: FSEvents or kqueue, see features Windows: ReadDirectoryChangesW iOS / FreeBSD / NetBSD / OpenBSD / DragonflyBSD: kqueue All platforms: polling
See: https://github.com/notify-rs/notify/?tab=readme-ov-file#platforms
Ah! Good to know. In that case we may be open to moving to it exclusively. This is likely to be a bigger effort than the suggestion in https://github.com/vectordotdev/vector/issues/3567#issuecomment-1686783252 , but we'd be happy to review a PR introducing either approach.
I'm not sure if this is a bug or expected behavior, but it looks like, when using the
ignore_older
config for thefile
source, Vector still maintains an open file handle to the files with a modified time before the cutoff.From gitter: https://gitter.im/timberio-vector/community?at=5f456c28c3aa024ef99e4907 . The user was trying to limit the number of open file handles, by using the
ignore_older
config.If it is expected, we should probably call it out in the docs as I was not expecting that. It seemingly limits its usefulness in avoiding resource consumption.
Vector Version
Vector Configuration File
Debug Output
Expected Behavior
Vector does not open the file
Actual Behavior
Vector opens the file
Additional Context