Closed gws closed 5 years ago
We have used https://github.com/solarwinds/fluentd-deployment in the past, but encountered similar issues where logging stopped after some period of time, and switched to https://github.com/fluent/fluentd-kubernetes-daemonset thinking that might solve the issue or at least narrow down the culprit.
I don't think we gathered as much data about the issue at the time, so I don't know if we were running into the same issue in that configuration.
@gws thanks for your interest in this plugin -- we'll take a look.
@cmrust do you have any ideas here?
@cmrust is it possible that what we're missing here is a code path to flush any existing buffer upon reaquisition of a working socket? I'm not familiar enough w/ the Ruby env these run in to know what might be available to the plugin API to force this. I'd assume we need to flush at this point, so we'd want to alter that lazy assignment to do a real nil check and then flush the buffer prior to writing the new line so that we preserve order.
Hey @gws, thanks for the issue. To restate, and please correct me if I misunderstand: upon killing and recreating the fluentd pod, the aforementioned missing logs (post connection reset error) are finally found and shipped to PT by the new pod.
My experience is that fluentd can become finicky when faced with errors, and like you've seen just stop progressing altogether. The community response to this, as far as I've seen, is to implement a Liveness probe based on the file buffers age.
We've been internally using something like this (within the DaemonSet config) for the past few months with success:
livenessProbe:
initialDelaySeconds: 600
periodSeconds: 60
exec:
command:
- '/bin/sh'
- '-c'
- >
LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300};
if [ ! -d /var/log/fluentd-buffers-papertrail ];
then
exit 1;
fi;
touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness;
if [[ -z "$(find /var/log/fluentd-buffers-papertrail -type f -newer /tmp/marker-liveness -print -quit)" ]];
then
rm -rf /var/log/fluentd-buffers-papertrail;
exit 1;
fi;
In this case, the buffers are checked every minute and if they haven't been touched in 5 minutes they are trashed and the pods are recycled. Our match stanza, for reference, looks like this:
# Capture all unmatched tags
<match **>
@type papertrail
num_threads 4
buffer_type file
buffer_path /var/log/fluentd-buffers-papertrail/fluentd-buffer
papertrail_host "#{ENV['FLUENT_PAPERTRAIL_HOST']}"
papertrail_port "#{ENV['FLUENT_PAPERTRAIL_PORT']}"
discard_unannotated_pod_logs true
</match>
I think this could help resolve your issue. If so, please let us know and I'll open an issue on fluentd-deployment instead so we can share the Liveness Probe as part of the recommended K8s configuration.
@trevrosen, I don't think that's necessary here. In the code, whenever we raise an error, fluentd maintains itself by catching and putting that failed event back on the buffer to be retried.
Then, lazily assigning the socket here should be okay, I believe, because we're manually nil'ing it each time we receive these errors. Let me know if that doesn't jive or if I'm missing something.
@cmrust You've got it, and that liveness probe looks like a very reasonable workaround for this case - I was just considering a few different ways to handle this and you've saved me some effort!
Thanks for the quick response along with those snippets - I'm satisfied with implementing liveness probes for our use cases, but I'll leave it up to you if you'd like to close this issue or leave it open for further investigation. It doesn't sound like the plugin is necessarily "at fault" here.
Either way, I'll report back in this thread when we've had some time to test the probes.
@gws that would be much appreciated -- thanks!
Our CI system logs very infrequently, and so the probe as written will kill the fluentd pods associated with that system due to lack of buffers (no files at all in the buffer directory), so we needed to add a guard to test if the buffer files are present as well as the test to see if they are some amount older than the marker, but it seems like it works for systems that are logging very regularly.
I'm not sure how your system is configured @gws, but I believe Fluentd is capable of exposing a metrics endpoint that can be consumed by Prometheus. If you're inclined this may be another route that allows you to programmatically detect if Fluentd has become unresponsive and take action.
I'm hoping to find time to dig in on the Fluentd metrics system soon and generate some custom ones from this plugin (how many concurrent connections we have open with PT, # of lines discarded, # of connection reset errors, SSL bad write errors, etc..).
Hi folks - first off, thanks for maintaining this plugin! :+1:
Occasionally (every few days or weeks), our Papertrail pods, running in Kubernetes as a DaemonSet, stop logging after a connection reset.
It appears that the plugin itself recovers from the error and continues logging, but the exception is not logged to Papertrail and all subsequent logs are missing from Papertrail. Interestingly, if we do kill the misbehaving pod, the graceful shutdown causes it to flush its buffer and we see all of the logs it was keeping to itself.
Here is an example with the stacktrace (PT endpoints redacted to
logs.papertrailapp.com:12345
):You can see the reinitialization of the TCP socket and then logs continue normally. However, we are missing (in the PT interface) a line or two before the stacktrace, the stacktrace itself, and then everything subsequently. The logs above were obtained directly from the pod via
kubectl logs
after somebody noticed some missing logs and we tracked down the responsible pod.Our
fluent.conf
:All of the other configuration for the plugin is standard, shipped with
fluentd-kubernetes-daemonset
.Versions:
Kubernetes: 1.11.6 Fluentd: 1.3.2 (using https://github.com/fluent/fluentd-kubernetes-daemonset) fluent-plugin-papertrail: 0.2.6
We're open to any suggestions if there is something we can do to work around the issue as well, or something we have misconfigured!