vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.85k stars 1.58k forks source link

Should vector be maintaining open file handles to `ignore_older` files older than the cutoff? #3567

Open jszwedko opened 4 years ago

jszwedko commented 4 years ago

I'm not sure if this is a bug or expected behavior, but it looks like, when using the ignore_older config for the file source, Vector still maintains an open file handle to the files with a modified time before the cutoff.

From gitter: https://gitter.im/timberio-vector/community?at=5f456c28c3aa024ef99e4907 . The user was trying to limit the number of open file handles, by using the ignore_older config.

If it is expected, we should probably call it out in the docs as I was not expecting that. It seemingly limits its usefulness in avoiding resource consumption.

Vector Version

vector 0.11.0 (g8b4ff32 x86_64-unknown-linux-gnu 2020-08-25)

Vector Configuration File

data_dir = "/tmp/vector"

[sources.in]
  type = "file" # required
  ignore_older = 10  # optional, no default, seconds
  include = ["/tmp/log/*.log"] # required

[sinks.http]
  type = "console"
  inputs = ["in"]
  encoding.codec = "json"

Debug Output

Aug 25 16:57:23.252  INFO vector: Log level "debug" is enabled.
Aug 25 16:57:23.256  INFO vector: Loading configs. path=["/tmp/test.toml"]
Aug 25 16:57:23.276  INFO vector::topology: Running healthchecks.
Aug 25 16:57:23.276  INFO vector::topology: Starting source "in"
Aug 25 16:57:23.277  INFO vector::topology::builder: Healthcheck: Passed.
Aug 25 16:57:23.277  INFO vector::topology: Starting sink "http"
Aug 25 16:57:23.277  INFO vector: Vector has started. version="0.11.0" git_version="v0.9.0-573-g8b4ff32" released="Tue, 25 Aug 2020 20:48:03 +0000" arch="x86_64"
Aug 25 16:57:23.277  INFO source{name=in type=file}: vector::sources::file: Starting file server. include=["/tmp/log/*.log"] exclude=[]
Aug 25 16:57:23.279  INFO source{name=in type=file}:file_server: vector::internal_events::file: found new file to watch. path="/tmp/log/a.log"
Aug 25 16:57:23.280 DEBUG source{name=in type=file}:file_server: vector::internal_events::file: files checkpointed. count=0
Aug 25 16:57:24.315 DEBUG source{name=in type=file}:file_server: vector::internal_events::file: files checkpointed. count=0
Aug 25 16:57:25.340 DEBUG source{name=in type=file}:file_server: vector::internal_events::file: files checkpointed. count=0
Aug 25 16:57:27.391 DEBUG source{name=in type=file}:file_server: vector::internal_events::file: files checkpointed. count=0
^CAug 25 16:57:28.368  INFO vector: Vector has stopped.
Aug 25 16:57:28.370  INFO vector::topology: Shutting down... Waiting on: in, http. 59 seconds left
Aug 25 16:57:28.370 DEBUG source{name=in type=file}: vector::topology::builder: Finished
Aug 25 16:57:28.370 DEBUG sink{name=http type=console}: vector::topology::builder: Finished

Expected Behavior

Vector does not open the file

Actual Behavior

Vector opens the file

Additional Context

$ lsof -p 27049 | grep a.log
vector  27049 CORP\jesse   15r      REG              259,3    104581 178538005 /tmp/log/a.log
ktff commented 4 years ago

So description for ignore_older option is somewhat misleading/incomplete. Currently it is:

Ignore files with a data modification date that does not exceed this age.

which doesn't say that it will collect newer data from the file. That's why it's tailing them, as it only ignores the old data not the file. So a clearer description would be:

Ignore existing data in files with a data modification date older than this age. Subsequent data will be collected.

This is the expected behavior, so just the documentation should be updated.

binarylogic commented 4 years ago

Hm, that's not how I thought the option to works. I think the behavior should change. The idea is to ignore older files as if they don't exist, not just older data. This ensures that Vector does not open file handles for the file, etc.

binarylogic commented 4 years ago

@ktff, we're working on an RFC to improve the file source in #3480. I think we should address this there.

lukesteensen commented 4 years ago

The idea is to ignore older files as if they don't exist, not just older data.

What would you want to happen with a file that initially has not been modified since ignore_older but then starts getting writes? If you want to start reading only the new data, we'd need to keep track of where that starts somehow. Right now we do it with the file cursor.

binarylogic commented 4 years ago

That's a good point. Ideally, we'd rediscover the file and read the new data only, but in the interim we would not hold a file handle for it.

jszwedko commented 4 years ago

That's a good point. Ideally, we'd rediscover the file and read the new data only, but in the interim we would not hold a file handle for it.

I think I like that, but it does seem like it would make it trickier to figure out which file contents were new. I guess you could keep track of the sizes of the ignored files and then seek to that point when their mtime changes? That wouldn't handle the case that the file is overwritten, rather than appended to, though.

lukesteensen commented 4 years ago

I guess you could keep track of the sizes of the ignored files and then seek to that point when their mtime changes? That wouldn't handle the case that the file is overwritten, rather than appended to, though.

Yeah, you'd at least need to make sure the fingerprint was the same or weird things could happen.

Another confusing case would be if Vector is started, ignores a file for an old mtime, then Vector is stopped, then the file is written to, then Vector is started again. At that point, it would (depending on the config) read the whole file from the beginning, including data it had been ignoring to that point. Right now I think we avoid that via the normal checkpointing of open files.

ktff commented 4 years ago

I think the behavior should change

In that case whatever solution we come up with to avoid holding file handles it should be useable for other files as well. So we can reduce total file handle usage further. This way we can avoid some of the special casing and use the normal checkpointing to avoid the issues @lukesteensen mentioned.

vbichov commented 3 years ago

Is there a workaround to that issue?

In my use-case, there are many short-lived jobs that generate roughly 100000 files per month. Now due to business requirements, we can't delete the files before that.
I'm looking for a way to limit the number of tailed files somehow (I.E by looking at ctime). Any way to do that?

ktff commented 3 years ago

@vbichov

Is there a workaround to that issue?

One way is to point Vector to a folder with symlinks to the files and then have two services/scripts that will create symlinks and delete older ones.

AzimovZaur commented 1 year ago

Hi, are you planning to make changes so that this parameter will force the vector to ignore old files and not open the handle?

AzimovZaur commented 1 year ago

Hi, I did change in /lib/file-source/src/file_watcher/mod.rs that FileWatcher.new() function returned a FileWatcher structure with the variable is_dead: true if variable too_old is true. So the old files were not opened and the vector works with directories where are a lot of old files. But there are 2 problems:

  1. If the record add is in an old file(by chance), then the file will be re-read from the beginning(ignored option read_from = "end")
  2. If the file did put on watcher, then it is not removed from watcher during the ignore_older_secs interval (or after what interval will it be removed from watcher?)

If such functionality appeared in future versions it would be very good, since now it is not possible to use it due to the large number of open files with directories where are a lot of old files

ethack commented 1 year ago

I've been using a workaround where, periodically, I add patterns for old files to my exclude config so that it doesn't consider older files.

exclude = [
    # hack to reduce start-up time and file descriptor usage
    "**/2022-*",
    "**/2023-01-*",
    "**/2023-02-*",
    "**/2023-03-*",
    "**/2023-04-*",
    "**/2023-05-*",
    "**/2023-06-*",
]

This is a pain to keep up to date and it would be great if I only had to set ignore_older_secs.

jesseorr commented 1 year ago

It would be nice if this file sink syntax:

[sinks.my_sink_id]
type = "file"
inputs = [ "my-source-or-transform-id" ]
path = "/tmp/vector-%Y-%m-%d.log"

Would work to apply a date string variable to the file source, such as this:

[sources.my_source_id]
type = "file"
include = [ "/var/log/**/%Y-%m-%d*.log" ]

This would let vector include files with timestamps that are generated in real time by the remote applications, while ignoring the older files. This would look like an enhanced glob match were some variables are included and must be resolved prior to the glob string being applied.

This would solve @ethack's issue, as well as a number of other applications that I've seen where datestamps are included in the active log file and where rotation is frequent, leading to many open file handles and heavy load by Vector to watch these files despite tuning.

jszwedko commented 1 year ago

Coming back to this, I think that Vector could avoid keeping an open file handle to ignored files by:

@lukesteensen curious if you have thoughts.

lukesteensen commented 1 year ago

Yeah, I think the best solution is likely to introduce another state to the file watcher where we still checkpoint the EOF but don't hold an active file handle. Right now it's basically all or nothing: we're either actively watching with an open handle or we ignore it entirely via exclude. Some kind of passive watching state for old/idle files would allow us to retain checkpoints in case the file receives future writes, but not take up a file handle and time polling for reads.

fitz123 commented 8 months ago

Our workflow involves generating numerous files daily, with only a few requiring active updates. The necessity to manually update the "exclude" list to manage resources effectively has become a significant operational burden.

A feature that allows Vector to intelligently ignore files based on their modification date, without maintaining open file handles, would greatly alleviate our current struggles. This would optimize resource usage and reduce manual overhead in our workflow. We strongly believe that such a feature would benefit many users facing similar challenges and hope to see it prioritized in Vector's development roadmap.

tamer-hassan commented 5 months ago

Yeah, I think the best solution is likely to introduce another state to the file watcher where we still checkpoint the EOF but don't hold an active file handle. Right now it's basically all or nothing: we're either actively watching with an open handle or we ignore it entirely via exclude. Some kind of passive watching state for old/idle files would allow us to retain checkpoints in case the file receives future writes, but not take up a file handle and time polling for reads.

Something like this would work, you think? https://github.com/notify-rs/notify/blob/main/examples/async_monitor.rs

tamer-hassan commented 1 month ago

As @fitz123 noted in previous comment above

The necessity to manually update the "exclude" list to manage resources effectively has become a significant operational burden.

It's been nearly 4 months since I commented on this issue and unfortunately haven't received feedback. And we, unfortunately, are facing the same burden and this is becoming more pressing.

Is there near-term interest and/or initiative to improve this, perhaps in a way similar to what has been proposed in comment above by @lukesteensen ?

Some kind of passive watching state for old/idle files would allow us to retain checkpoints in case the file receives future writes, but not take up a file handle and time polling for reads.

What do you think of my proposition in my previous comment (notify-rs async monitor) as an improvement to the current file watcher implementation?

jszwedko commented 1 month ago

I think we'd be open to a pull request that implemented what I proposed in https://github.com/vectordotdev/vector/issues/3567#issuecomment-1686783252

As an enhancement we could also rely on inotify when available, but I don't think we can exclusively move to it due to lack of support on Windows (and other platforms?).

tamer-hassan commented 1 month ago

As an enhancement we could also rely on inotify when available, but I don't think we can exclusively move to it due to lack of support on Windows (and other platforms?).

https://github.com/notify-rs/notify/ is a Cross-platform filesystem notification library for Rust, not only for Linux (inotify)

Currently:

Linux / Android: inotify
macOS: FSEvents or kqueue, see features
Windows: ReadDirectoryChangesW
iOS / FreeBSD / NetBSD / OpenBSD / DragonflyBSD: kqueue
All platforms: polling

See: https://github.com/notify-rs/notify/?tab=readme-ov-file#platforms

jszwedko commented 1 month ago

As an enhancement we could also rely on inotify when available, but I don't think we can exclusively move to it due to lack of support on Windows (and other platforms?).

https://github.com/notify-rs/notify/ is a Cross-platform filesystem notification library for Rust, not only for Linux (inotify)

Currently:

Linux / Android: inotify
macOS: FSEvents or kqueue, see features
Windows: ReadDirectoryChangesW
iOS / FreeBSD / NetBSD / OpenBSD / DragonflyBSD: kqueue
All platforms: polling

See: https://github.com/notify-rs/notify/?tab=readme-ov-file#platforms

Ah! Good to know. In that case we may be open to moving to it exclusively. This is likely to be a bigger effort than the suggestion in https://github.com/vectordotdev/vector/issues/3567#issuecomment-1686783252 , but we'd be happy to review a PR introducing either approach.