uken / fluent-plugin-elasticsearch

Apache License 2.0
888 stars 309 forks source link

ECONNREFUSED due to the connecting to wrong port after while #1001

Open PScharrenberg opened 1 year ago

PScharrenberg commented 1 year ago


fluent-plugin-elasticsearch successfully pushes logs to our elasticsearch server located behind a ssl-offloading nginx proxy listening on port 443. After a while (a few hours) no logs are transferred anymore and we find this warning-message in the fluentd logs (where X.X.X.X is the correct ip address of our es server):

2023-02-21 11:07:15 +0000 [warn]: #0 [clusterflow:flow] failed to flush the buffer. retry_times=12 next_retry_time=2023-02-21 12:10:46 +0000 chunk="5f52bba4e6c17284274d9814840cea63" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch-fqdn\", :port=>443, :scheme=>\"https\", :user=>\"logging\", :password
=>\"obfuscated\"}): Connection refused - connect(2) for X.X.X.X:9200 (Errno::ECONNREFUSED)"

So after a while it tries to connect to the elasticsearch server directly without proxy, which obviously does not work.

After restarting fluentd inside of the k8s pod (fluent-ctl restart) the logs are shipped again

Steps to replicate

The relevant config part in fluentd.conf:

  <match **>
    @type elasticsearch
    @id clusterflow:flow
    exception_backup true
    fail_on_putting_template_retry_exceed true
    host elasticsearch-fqdn
    logstash_dateformat %Y-%m-%d
    logstash_format true
    logstash_prefix logging
    password xxxxxxxxxx
    port 443
    reload_connections true
    scheme https
    ssl_verify true
    user logging
    utc_index true
    verify_es_version_at_startup true
    <buffer tag,time>
      @type file
      chunk_limit_size 8MB
      path /buffers/clusterflow:flow.*.buffer
      retry_forever true
      timekey 10m
      timekey_wait 1m

Expected Behavior or What you need to ask

We expect it to continue connecting to the configured port.

Using Fluentd and ES plugin versions

We're using the rancher-logging "app" provided by rancher (rancher-logging:100.1.3+up3.17.7) We're seeing this issue after upgrading from an older version.

cosmo0920 commented 1 year ago

This could be occurred by Elasticsearch Sniffering feature.

How to enable this feature, see:

GiZZoR commented 1 year ago

You probably hit this time bomb someone left for you: This causes the activation of the sniffer. Yes, a sniffer that hunts out the nodes in your ES cluster and then bypasses the configuration you explicitly set, thereby voiding any load balancing you may have configured. Bonus feature: it uses the scheme from the config you supplied to hit the host and port it finds in the nodes catalog.

I'd recommend reload_connections false, as the sniffer just shouldn't be needed in any properly configured environment. You'd either correctly configure the hosts it uses, or use a load balancer.
This "feature" should only be enabled if explicitly needed, which should be never.

IMHO the sniffer should exist as an optional plugin, and should be promptly removed/disabled.