Open josephmcasey opened 2 years ago
We were able to resolve this by adding the following lines to the in_tail_container_logs
source:
follow_inodes true
open_on_every_update true
We did this by projecting our own kubernetes.conf
over the template in the reloader image
Thanks for opening this issue @josephmcasey. We have actually seen this one ourselves but it's an interesting issue and has some compliance and security implications for changing how it operates today. Not sure what the previous action was but "open_on_every_update" was not added until 0.14.12 and prior to 1.15 the fluentd version was 1.12 so that may explain that part.
It would be good to add an "opt-in" flag but not make it the default. The default log rotation in kubernetes for containerd is 10Mi per file and 5 files. (https://github.com/kubernetes/kubernetes/pull/59898/files#diff-aa85fa10ff2032cc1aeeb608a71cc25ecbc85c357cc1e1d44b07ce2d46ab8555)
if this is enabled by default and you are running a default containerd config, then I only need to generate a little over 50Mi of logs in a pod to cause fluentd to start dropping logs and not sending them off. This can easily be used for an attacker to hide their tracks if it is defaulted to enabled and should not be enabled without caution and understanding that it will drop logs rather quickly under load.
The ideal solution would be to only drop logs if the volume is almost out of space and set configurable thresholds of when to start leveraging this feature instead of just having it on all the time. If there are short bursts in logs, the local disk should have space to buffer until it can send them off and then this isn't a problem. This issue really only happens when there is an extreme amount of logs for an extended amount of time, which is arguably a different issue entirely and rather an issue for the application but KFO should have some safeguards to try and prevent it from causing issues on the host.
Describe the bug
When running versions of the operator >v1.15.1 under heavy logging load deleted files are being held open. Previous versions do not have this issue. This results in rising disk consumption until the node is full. Cycling the log-router pods releases all the deleted files and the spare is cleared.
Reproduction steps
Using a workload like:
Which simply outputs a large text statement repeatedly will cause escalating disk pressure.
lsof +L1
on a node shows the open filesThe logs of the
fluentd
container:Expected behavior
The logs should be deleted released upon rotation.
Additional context
No response