Vector keeps old Inodes, clogging up disk

MaxRink commented 2 years ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We ran into the issue that Vector is keeping old Inodes open, thus filling up the disk

root@mdo-1-8vwrt:/home/a92615428# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda4        49G   49G     0 100% /

INodes: https://gist.github.com/MaxRink/ee056e27a4b11a7b710e437e1f892984 After pkilling vector disk usage returned to normal

root@mdo-1-8vwrt:/home/a92615428# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda4        49G  8.5G   38G  19% /
```̀

### Configuration

```text
api:
  address: 0.0.0.0:8686
  enabled: true
  playground: false
data_dir: /var/lib/vector/
sinks:
  prometheus_metrics:
    address: 0.0.0.0:9090
    inputs:
    - internal_metrics
    type: prometheus_exporter
  vector_sink:
    address: vector-aggregator:9000
    inputs:
    - kubernetes_logs
    - annotate_hostname
    type: vector
    version: "2"
sources:
  internal_metrics:
    type: internal_metrics
  json_logfiles_var_log:
    include:
    - /var/log/kube-apiserver/kube-apiserver.log
    max_line_bytes: 1536000
    type: file
  kubernetes_logs:
    type: kubernetes_logs
  logfiles_var_log:
    exclude:
    - /var/log/pods/**/*.log
    - /var/log/containers/**/*.log
    - /var/log/kube-apiserver/kube-apiserver.log
    - /var/log/kube-apiserver/kube-apiserver*.log
    - /var/log/**/*.gz
    - /var/log/journal/**/*
    include:
    - /var/log/**/*.log
    - /var/log/private
    - /var/log/lastlog
    - /var/log/syslog
    - /var/log/btmp
    - /var/log/faillog
    - /var/log/wtmp
    - /var/log/dmesg
    max_line_bytes: 409600
    type: file
transforms:
  annotate_hostname:
    inputs:
    - logfiles*
    source: .host = "${VECTOR_SELF_NODE_NAME}"
    type: remap
  logfiles_json_decode:
    inputs:
    - json_logfiles_var_log
    source: .msg_decoded = parse_json!(.message)
    type: remap

Version

0.20.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 2 years ago

Hi @MaxRink !

Vector will hold onto deleted files until it has fully read them. Can you tell if that's what is happening here?

MaxRink commented 2 years ago

They got rotated (and after a few iterations deleted ) by logrotate or by the k8s internal apiserver logrotate

jszwedko commented 2 years ago

@MaxRink makes sense, but do you know if Vector is still reading from them? Attaching strace should tell you.

MaxRink commented 2 years ago

i dont think it is.

im not seeing any output from strace -p 1326978 -e trace=file(where 1326978 is the vector pid)

edit: forgot to do recursive, as all the io happens in child processes as it seems but im not seeing access to the files above in there either https://gist.github.com/MaxRink/05dd37f6d3daf7b996821eb46b85ee29

shomilj commented 1 year ago

We're also seeing this issue -- maybe it's because we're not reaching EOF for some reason? Not sure why that would be happening, though.

punkerpunker commented 1 year ago

Also faced this issue last night, vector keeps file descriptors for rotated logs, any known workaround for that?

jszwedko commented 1 year ago

https://github.com/vectordotdev/vector/pull/17603 may fix this.

jcantrill commented 1 year ago

:+1: We have this issue reported for several large customers in https://issues.redhat.com/browse/LOG-3949 Fluentd implementations has a similar issue but they also have a config which allows them to not necessarily read entirely from rotated files by introducing a delay config parameter which may be a useful feature instead of always trying to read to EOF

jszwedko commented 1 year ago

👍 We have this issue reported for several large customers in https://issues.redhat.com/browse/LOG-3949 Fluentd implementations has a similar issue but they also have a config which allows them to not necessarily read entirely from rotated files by introducing a delay config parameter which may be a useful feature instead of always trying to read to EOF

The current implementation of the file source is expected to tail files until they are deleted. Once they are deleted, and EOF has been read, the handle will be released. If users are hitting a condition under which Vector has read to EOF of deleted files and still hasn't released the file handle this would be a bug. We haven't been able to reproduce that behavior as of yet though so would love a reproduction case 🙂 .

Having a similar parameter to rotate_wait from Fluentd would cause less pressure but would risk missing more logs if Vector hadn't finished reading the deleted file.

jszwedko commented 1 year ago

https://github.com/vectordotdev/vector/pull/17603#issuecomment-1590141533 steps through how this logic of reading deleted files until EOF works in Vector.

jcantrill commented 1 year ago

@jszwedko we are using the kubernetes source? Does that matter?

jszwedko commented 1 year ago

@jszwedko we are using the kubernetes source? Does that matter?

It shouldn't. The behavior is the same since they share the same underlying file tailing implementation.

akutta commented 1 year ago

https://github.com/vectordotdev/vector/issues/18088#issuecomment-1690491190

jcantrill commented 1 year ago

Having a similar parameter to rotate_wait from Fluentd would cause less pressure but would risk missing more logs if Vector hadn't finished reading the deleted file.

@jszwedko I would like to revisit this as I realize the ramification of adding such a parameter is log loss; we see it now with fluentd. It's naive to think, however, the collector and host resources are infinite and the collector is always able to keep up with the log volume. We routinely push the collector on OpenShift clusters to the point where vector as unable to keep up with the load. As an admin is it better for me to "never lose a log" or to continue to collect? A configuration point would be an "opt-in" choice where admins know the trade-offs. I suspect it could be instrumented as well with a metric to identify when didn't reach EOF or something similar upon which to write an alert against.

eddy1o2 commented 1 year ago

Hi team, I also faced this issue. Have anyone solve it yet? Basically,

the actual node disk space used by other processes is pretty small (when we run du on the node) - ours is around 10Gb
when we run lsof -nP +L1 to list non-closed file descriptors, we find ~90Gb due to vector - this is in-line with what df shows - so basically those log files are supposed to be deleted (due to log rotation) but they aren’t and still occupying disk space because vector as a process is still holding onto it

My vector specs:

Version: 0.28.0-debian

Configuration

sources:
  kubernetes_logs:
    type: kubernetes_logs
    glob_minimum_cooldown_ms: 500
    max_read_bytes: 2048000

jszwedko commented 1 year ago

https://github.com/vectordotdev/vector/issues/11742#issuecomment-1675466517 is the expectation. In your case, it means that Vector is still reading from those deleted files. If you believe that not to be the case, we would definitely appreciate a reproduction case for this as we've tried a few times but haven't been able to reproduce behavior where Vector doesn't release the file handle once it gets to EOF for a deleted file.

eddy1o2 commented 1 year ago

I see but in https://github.com/vectordotdev/vector/issues/11742#issuecomment-1675466517, you have mentioned the file source while we are using kubernetes_log source. Is it still expected?

eddy1o2 commented 1 year ago

Or it related to the fix of https://github.com/vectordotdev/vector/issues/18088 for kubenetes_log source?

jszwedko commented 1 year ago

I see but in #11742 (comment), you have mentioned the file source while we are using kubernetes_log source. Is it still expected?

Ah, yes, they use the same underlying mechanisms for file reading so the behavior, in this respect, should be the same. Thanks for clarifying!

neuronull commented 1 year ago

After discussing this one with the team we came to the following conclusion:

We will keep this bug report as-is, but close it because the behavior of Vector in this case isn't really a bug (Vector is working as designed) it's just the symptom presents itself in that way when there is really an underlying performance concern at some point in the pipeline which when addressed this issue would go away.
A separate issue (#18863) to track the feature request of having a config option to essentially flag similar to Fluend's rotate_wait that is opt-in and would mean Vector doesn't read to EOF.
A separate issue (#18864) to track adding visibility into Vector's internal telemetry so that this situation is more easily identifiable to users.

benjaminhuo commented 6 months ago

cc @wanjunlei

vectordotdev / vector