solarwinds / fluent-plugin-papertrail

Fluentd integration with Papertrail.
Apache License 2.0
21 stars 11 forks source link

Plugin stops sending logs to endpoint after Fluent::Papertrail::SocketFailureError #27

Closed gws closed 5 years ago

gws commented 5 years ago

Hi folks - first off, thanks for maintaining this plugin! :+1:

Occasionally (every few days or weeks), our Papertrail pods, running in Kubernetes as a DaemonSet, stop logging after a connection reset.

It appears that the plugin itself recovers from the error and continues logging, but the exception is not logged to Papertrail and all subsequent logs are missing from Papertrail. Interestingly, if we do kill the misbehaving pod, the graceful shutdown causes it to flush its buffer and we see all of the logs it was keeping to itself.

Here is an example with the stacktrace (PT endpoints redacted to logs.papertrailapp.com:12345):

2019-01-25 03:35:59 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 1, pod_cache_size: 4, namespace_cache_api_updates: 1256, pod_cache_api_updates: 1259, id_cache_miss: 1256
2019-01-25 03:36:29 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 1, pod_cache_size: 4, namespace_cache_api_updates: 1256, pod_cache_api_updates: 1259, id_cache_miss: 1256
2019-01-25 03:36:59 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 1, pod_cache_size: 4, namespace_cache_api_updates: 1256, pod_cache_api_updates: 1259, id_cache_miss: 1256
2019-01-25 03:37:27 +0000 [warn]: #0 [out_papertrail] failed to flush the buffer. retry_time=0 next_retry_seconds=2019-01-25 03:37:28 +0000 chunk="580400bbe313472fef6c2f9cc64d6193" error_class=Fluent::Papertrai
l::SocketFailureError error="Error writing to logs.papertrailapp.com:12345: Connection reset by peer"
  2019-01-25 03:37:27 +0000 [warn]: #0 /usr/lib/ruby/2.3.0/openssl/buffering.rb:322:in `syswrite'
  2019-01-25 03:37:27 +0000 [warn]: #0 /usr/lib/ruby/2.3.0/openssl/buffering.rb:322:in `do_write'
  2019-01-25 03:37:27 +0000 [warn]: #0 /usr/lib/ruby/2.3.0/openssl/buffering.rb:407:in `puts'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-papertrail-0.2.6/lib/fluent/plugin/out_papertrail.rb:119:in `send_to_papertrail'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-papertrail-0.2.6/lib/fluent/plugin/out_papertrail.rb:45:in `block in write'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.2/lib/fluent/event.rb:323:in `each'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.2/lib/fluent/event.rb:323:in `block in each'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.2/lib/fluent/plugin/buffer/file_chunk.rb:172:in `open'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.2/lib/fluent/event.rb:322:in `each'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-papertrail-0.2.6/lib/fluent/plugin/out_papertrail.rb:39:in `write'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.2/lib/fluent/compat/output.rb:131:in `write'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.2/lib/fluent/plugin/output.rb:1123:in `try_flush'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.2/lib/fluent/plugin/output.rb:1423:in `flush_thread_run'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.2/lib/fluent/plugin/output.rb:452:in `block (2 levels) in start'
  2019-01-25 03:37:27 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.2/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2019-01-25 03:37:28 +0000 [info]: #0 [out_papertrail] initializing tcp socket for logs.papertrailapp.com:12345
2019-01-25 03:37:29 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 1, pod_cache_size: 4, namespace_cache_api_updates: 1256, pod_cache_api_updates: 1259, id_cache_miss: 1256
2019-01-25 03:37:59 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 1, pod_cache_size: 4, namespace_cache_api_updates: 1256, pod_cache_api_updates: 1259, id_cache_miss: 1256
2019-01-25 03:38:29 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 1, pod_cache_size: 3, namespace_cache_api_updates: 1256, pod_cache_api_updates: 1259, id_cache_miss: 1256

You can see the reinitialization of the TCP socket and then logs continue normally. However, we are missing (in the PT interface) a line or two before the stacktrace, the stacktrace itself, and then everything subsequently. The logs above were obtained directly from the pod via kubectl logs after somebody noticed some missing logs and we tracked down the responsible pod.

Our fluent.conf:

    ### Based on https://github.com/fluent/fluentd-kubernetes-daemonset

    @include systemd.conf
    @include kubernetes.conf
    @include conf.d/*.conf

    ## Capture audit logs
    ##<match kube-apiserver-audit>
    ##  @type papertrail
    ##
    ##  papertrail_host "#{ENV['FLUENT_PAPERTRAIL_AUDIT_HOST']}"
    ##  papertrail_port "#{ENV['FLUENT_PAPERTRAIL_AUDIT_PORT']}"
    ##</match>

    <match **>
      @type papertrail
      @id out_papertrail

      papertrail_host "#{ENV['FLUENT_PAPERTRAIL_HOST']}"
      papertrail_port "#{ENV['FLUENT_PAPERTRAIL_PORT']}"

      buffer_type file
      buffer_path /var/log/fluentd-buffer-papertrail
    </match>

All of the other configuration for the plugin is standard, shipped with fluentd-kubernetes-daemonset.

Versions:

Kubernetes: 1.11.6 Fluentd: 1.3.2 (using https://github.com/fluent/fluentd-kubernetes-daemonset) fluent-plugin-papertrail: 0.2.6

We're open to any suggestions if there is something we can do to work around the issue as well, or something we have misconfigured!

gws commented 5 years ago

We have used https://github.com/solarwinds/fluentd-deployment in the past, but encountered similar issues where logging stopped after some period of time, and switched to https://github.com/fluent/fluentd-kubernetes-daemonset thinking that might solve the issue or at least narrow down the culprit.

I don't think we gathered as much data about the issue at the time, so I don't know if we were running into the same issue in that configuration.

trevrosen commented 5 years ago

@gws thanks for your interest in this plugin -- we'll take a look.

@cmrust do you have any ideas here?

trevrosen commented 5 years ago

@cmrust is it possible that what we're missing here is a code path to flush any existing buffer upon reaquisition of a working socket? I'm not familiar enough w/ the Ruby env these run in to know what might be available to the plugin API to force this. I'd assume we need to flush at this point, so we'd want to alter that lazy assignment to do a real nil check and then flush the buffer prior to writing the new line so that we preserve order.

cmrust commented 5 years ago

Hey @gws, thanks for the issue. To restate, and please correct me if I misunderstand: upon killing and recreating the fluentd pod, the aforementioned missing logs (post connection reset error) are finally found and shipped to PT by the new pod.

My experience is that fluentd can become finicky when faced with errors, and like you've seen just stop progressing altogether. The community response to this, as far as I've seen, is to implement a Liveness probe based on the file buffers age.

We've been internally using something like this (within the DaemonSet config) for the past few months with success:

livenessProbe:
  initialDelaySeconds: 600
  periodSeconds: 60
  exec:
    command:
    - '/bin/sh'
    - '-c'
    - >
      LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300};
      if [ ! -d /var/log/fluentd-buffers-papertrail ];
      then
        exit 1;
      fi;
      touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness;
      if [[ -z "$(find /var/log/fluentd-buffers-papertrail -type f -newer /tmp/marker-liveness -print -quit)" ]];
      then
        rm -rf /var/log/fluentd-buffers-papertrail;
          exit 1;
      fi;

In this case, the buffers are checked every minute and if they haven't been touched in 5 minutes they are trashed and the pods are recycled. Our match stanza, for reference, looks like this:

# Capture all unmatched tags
<match **>
  @type papertrail
  num_threads 4
  buffer_type    file
  buffer_path    /var/log/fluentd-buffers-papertrail/fluentd-buffer
  papertrail_host "#{ENV['FLUENT_PAPERTRAIL_HOST']}"
  papertrail_port "#{ENV['FLUENT_PAPERTRAIL_PORT']}"
  discard_unannotated_pod_logs true
</match>

I think this could help resolve your issue. If so, please let us know and I'll open an issue on fluentd-deployment instead so we can share the Liveness Probe as part of the recommended K8s configuration.

cmrust commented 5 years ago

@trevrosen, I don't think that's necessary here. In the code, whenever we raise an error, fluentd maintains itself by catching and putting that failed event back on the buffer to be retried.

Then, lazily assigning the socket here should be okay, I believe, because we're manually nil'ing it each time we receive these errors. Let me know if that doesn't jive or if I'm missing something.

gws commented 5 years ago

@cmrust You've got it, and that liveness probe looks like a very reasonable workaround for this case - I was just considering a few different ways to handle this and you've saved me some effort!

Thanks for the quick response along with those snippets - I'm satisfied with implementing liveness probes for our use cases, but I'll leave it up to you if you'd like to close this issue or leave it open for further investigation. It doesn't sound like the plugin is necessarily "at fault" here.

gws commented 5 years ago

Either way, I'll report back in this thread when we've had some time to test the probes.

trevrosen commented 5 years ago

@gws that would be much appreciated -- thanks!

gws commented 5 years ago

Our CI system logs very infrequently, and so the probe as written will kill the fluentd pods associated with that system due to lack of buffers (no files at all in the buffer directory), so we needed to add a guard to test if the buffer files are present as well as the test to see if they are some amount older than the marker, but it seems like it works for systems that are logging very regularly.

cmrust commented 5 years ago

I'm not sure how your system is configured @gws, but I believe Fluentd is capable of exposing a metrics endpoint that can be consumed by Prometheus. If you're inclined this may be another route that allows you to programmatically detect if Fluentd has become unresponsive and take action.

I'm hoping to find time to dig in on the Fluentd metrics system soon and generate some custom ones from this plugin (how many concurrent connections we have open with PT, # of lines discarded, # of connection reset errors, SSL bad write errors, etc..).