kannanvr commented 3 years ago

Problem

We have deployed the fluentd with elasticsearch. Fluentd could not send Logs to elasticsearch if elasticsearch is restarted Following is the Logs

2021-06-03 17:23:46 +0000 [warn]: #0 [elasticsearch] failed to flush the buffer. retry_time=230 next_retry_seconds=2021-06-03 17:23:51 +0000 chunk="5c3da63331dcd9ec72c8bed8230b9804" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"central-es.tcl-logging.svc.cluster.local\", :port=>80, :scheme=>\"http\"}): Connection timed out - connect(2) for 10.233.68.37:9200 (Errno::ETIMEDOUT)"
  2021-06-03 17:23:46 +0000 [warn]: #0 suppressed same stacktrace
2021-06-03 17:23:46 +0000 [warn]: #0 [elasticsearch] failed to flush the buffer. retry_time=230 next_retry_seconds=2021-06-03 17:23:50 +0000 chunk="5c3da63bc726b4093fcecd7b06d17c29" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"central-es.tcl-logging.svc.cluster.local\", :port=>80, :scheme=>\"http\"}): Connection timed out - connect(2) for 10.233.68.37:9200 (Errno::ETIMEDOUT)"

This is happening when Elastic serach pod is resterted after 12 hours of fluentd deployment. for the entire elastic search pod IP, we have configured the service IP. We have configured the service FQDN to send the fluentd data to elasticsearch. BUt it is trying to send via pod IP. We dont want fluentd send the data via pod IP directly, It should send via the service IP which we have configured. How to fix this issue ?

Steps to replicate

deploy the fluentd with elasticsearch as a pod. After 12 hours, restart Elasticserach. Fluentd could not send Logs to elasticsearch because it is still trying to send to POD IP rather than service IP

configuration

    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level info
      include_tag_key true
      host central-es.tcl-logging.svc.cluster.local
      port 80
      logstash_prefix process_logs
      logstash_format true
      suppress_type_name true
      request_timeout 2000
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 8
        flush_interval 8s
        retry_forever true
        retry_max_interval 5
        chunk_limit_size 8M
        queue_limit_length 10
        reconnect_on_error true
        reload_on_failure false
        reload_connections false
      </buffer>
    </match>
    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level info
      include_tag_key true
      host central-es.tcl-logging.svc.cluster.local
      port 80
      logstash_prefix process_logs
      logstash_format true
      suppress_type_name true
      request_timeout 2000
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 8
        flush_interval 8s
        retry_forever true
        retry_max_interval 5
        chunk_limit_size 8M
        queue_limit_length 10
        reconnect_on_error true
        reload_on_failure true
        reload_connections true
      </buffer>
    </match>

Expected Behavior or What you need to ask

Fluentd should send the data to elasticsearch service IP rather than pod IP

Using Fluentd and ES plugin versions

OS version- centos7 and 8
Bare Metal or within Docker or Kubernetes or others? kubernetes
Fluentd v0.12 or v0.14/v1.0 - fluentd 1.12.0
- paste result of fluentd --version or td-agent --version

ES plugin 3.x.y/2.x.y or 1.x.y

paste boot log of fluentd or td-agent

paste result of fluent-gem list, td-agent-gem list or your Gemfile.lock

addressable (2.7.0)
bigdecimal (default: 1.4.1)
bundler (2.2.6, default: 1.17.2)
cmath (default: 1.0.0)
concurrent-ruby (1.1.8)
cool.io (1.7.0)
csv (default: 3.0.9)
date (default: 2.0.0)
dbm (default: 1.0.0)
domain_name (0.5.20190701)
e2mmap (default: 0.1.0)
elasticsearch (7.10.1)
elasticsearch-api (7.10.1)
elasticsearch-transport (7.10.1)
elasticsearch-xpack (7.10.1)
etc (default: 1.0.1)
excon (0.78.1)
faraday (1.3.0)
faraday-net_http (1.0.1)
fcntl (default: 1.0.0)
ffi (1.14.2)
ffi-compiler (1.0.1)
fiddle (default: 1.0.0)
fileutils (default: 1.1.0)
fluent-config-regexp-type (1.0.0)
fluent-plugin-concat (2.4.0)
fluent-plugin-dedot_filter (1.0.0)
fluent-plugin-detect-exceptions (0.0.13)
fluent-plugin-elasticsearch (4.3.3)
fluent-plugin-grok-parser (2.6.2)
fluent-plugin-json-in-json-2 (1.0.2)
fluent-plugin-kubernetes_metadata_filter (2.6.0)
fluent-plugin-multi-format-parser (1.0.0)
fluent-plugin-prometheus (1.8.5)
fluent-plugin-record-modifier (2.1.0)
fluent-plugin-rewrite-tag-filter (2.4.0)
fluent-plugin-systemd (1.0.2)
fluentd (1.12.0)
forwardable (default: 1.2.0)
gdbm (default: 2.0.0)
http (4.4.1)
http-accept (1.7.0)
http-cookie (1.0.3)
http-form_data (2.3.0)
http-parser (1.2.3)
http_parser.rb (0.6.0)
io-console (default: 0.4.7)
ipaddr (default: 1.2.2)
irb (default: 1.0.0)
json (default: 2.1.0)
jsonpath (1.1.0)
kubeclient (4.9.1)
logger (default: 1.3.0)
lru_redux (1.1.0)
matrix (default: 0.1.0)
mime-types (3.3.1)
mime-types-data (3.2020.1104)
msgpack (1.3.3)
multi_json (1.15.0)
multipart-post (2.1.1)
mutex_m (default: 0.1.0)
netrc (0.11.0)
oj (3.11.0)
openssl (default: 2.1.2)
ostruct (default: 0.1.0)
prime (default: 0.1.0)
prometheus-client (0.9.0)
psych (default: 3.1.0)
public_suffix (4.0.6)
quantile (0.2.1)
rake (13.0.3)
rdoc (default: 6.1.2)
recursive-open-struct (1.1.3)
rest-client (2.1.0)
rexml (default: 3.1.9)
rss (default: 0.2.7)
ruby2_keywords (0.0.4)
scanf (default: 1.0.0)
sdbm (default: 1.0.0)
serverengine (2.2.2)
shell (default: 0.7)
sigdump (0.2.4)
stringio (default: 0.0.2)
strptime (0.2.5)
strscan (default: 1.0.0)
sync (default: 0.5.0)
systemd-journal (1.3.3)
thwait (default: 0.1.0)
tracer (default: 0.1.0)
tzinfo (2.0.4)
tzinfo-data (1.2021.1)
unf (0.1.4)
unf_ext (0.0.7.7)
webrick (default: 1.4.2)
yajl-ruby (1.4.1)
zlib (default: 1.0.0)

ES version (optional) 7.9.3
ES template(s) (optional)

cosmo0920 commented 3 years ago

How about using reload_on_failure as true? Current your settings just continues on error. And nothing to remove dead node from node information.

        reconnect_on_error true
        reload_on_failure true
        reload_connections false

And also could you read the following link? I guess that there is description what you want.

ref: https://github.com/uken/fluent-plugin-elasticsearch/blob/84e67c94bd88dd384ccf8e070d54bc479f3b2f92/README.ElasticsearchInput.md#reload_on_failure

kannanvr commented 3 years ago

Thanks @cosmo0920 . We are testing it out these options. ALso, Just a curious to understand , why fluentd is trying to send a data to podIP of elasticsearch directly after some 10 to 12 Hours. Is there a option to set the fluentd for not to send the pod IP directly when svc ip is configured.?

Thanks, Kannan V

cosmo0920 commented 3 years ago

why fluentd is trying to send a data to podIP of elasticsearch directly after some 10 to 12 Hours.

This is not Fluentd Core functionality. It comes from the dependency of ES plugin which is elasticsearch gem. This gem manages elasticsearch node as IP list as usual.

Is there a option to set the fluentd for not to send the pod IP directly when svc ip is configured.?

Elasticsearch plugin doesn't touch this mechanism which comes from elasticsearch gem. It depends on your Elasticsearch cluster settings and elasticsearch gem functionality.

kannanvr commented 3 years ago

@cosmo0920 , If there is a elasticsearch settings , please let us know what parameter we need to change it? Or if it is elasticsearch gem functionality, where to raise this issue ?

This feature is really intelligent to send the data. But in our environment , we are sending data to Incluster ES cluster and some times Remote cluster ES Cluster. In both the approach fluentd is trying to send an ES pod IP after some time.

Remote ES cluster is running as a pod. When fluentd is trying to send the data to Remote ES cluster, it is trying to send to podIP of remote cluster which is not reachable from remote cluster. We have changed the settings as you have mentioned above. But it is still trying to send to podIP which is not required for us. and it is not reloading also.

    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level info
      include_tag_key true
      host remote-es-cluster
      port 80
      logstash_prefix test
      logstash_format true
      suppress_type_name true
      request_timeout 2000
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 8
        flush_interval 8s
        retry_forever true
        retry_max_interval 5
        chunk_limit_size 8M
        queue_limit_length 10
        reconnect_on_error true
        reload_on_failure true
        reload_connections false
      </buffer>
    </match>

Request to provide your valuable suggestion in our use case

Thanks, Kannan V

cosmo0920 commented 3 years ago

Or if it is elasticsearch gem functionality, where to raise this issue ?

https://github.com/elastic/elasticsearch-ruby

cosmo0920 commented 3 years ago

And how about using sniffer_class_name parameter? https://github.com/uken/fluent-plugin-elasticsearch#sniffer-class-name As usual, not k8s environment, ES plugin works well with sniffer functionality which uses _node API: https://github.com/elastic/elasticsearch-ruby/blob/8fd7b0868db8ee06ea33f363c66b2545d037d00e/elasticsearch-transport/lib/elasticsearch/transport/transport/sniffer.rb#L46-L68

But on k8s environment, this parameter and bundled Fluent::Plugin::ElasticsearchSimpleSniffer class is useful to prevent node fetching information as each of Pod IPs: https://github.com/uken/fluent-plugin-elasticsearch/blob/master/lib/fluent/plugin/elasticsearch_simple_sniffer.rb

kannanvr commented 3 years ago

Ok Thanks @cosmo0920 . We will try this parameter and inform you the progress after 12 Hrs. we are going to use Fluent::Plugin::ElasticsearchSimpleSniffer as a sniffer class

cosmo0920 commented 3 years ago

@kannanvr How's going your testing?

kannanvr commented 3 years ago

@cosmo0920 , we are still facing this issue. As of now we are restarting fluentd in every 3 hours once temporarily. We need to resolve this issue. Is there any other way to solve this?

cosmo0920 commented 3 years ago

As of now we are restarting fluentd in every 3 hours once temporarily. We need to resolve this issue. Is there any other way to solve this?

If SimpleSniffer does not solve this issue, there is no way to avoid this. Why don't you send your issue in https://github.com/elastic/elasticsearch-ruby? It is too hard to describe your issue in yourself?

kannanvr commented 3 years ago

No. I will raise the issue at efk-ruby project now.

cosmo0920 commented 3 years ago

No. I will raise the issue at efk-ruby project now.

Thanks! :muscle:

cosmo0920 commented 3 years ago

FYI: You should write down step to reproduce this issue on https://github.com/elastic/elasticsearch-ruby/issues/1353.

Only with the written information on https://github.com/elastic/elasticsearch-ruby/issues/1353, I can't reproduce the problem either.

kannanvr commented 3 years ago

@cosmo0920 , Updated with the detailed info on elasticsearch-ruby project. Thanks for your Help

jambu commented 2 years ago

We are facing this exact same issue, our elasticsearch is running on K8s and fluentd is outside talking via metallb, any known fix for this ? Does an older version not have this issue ?

uken / fluent-plugin-elasticsearch

ES plugin grips PodIP instead of FQDN name as node information #889

Problem

Steps to replicate

configuration

Expected Behavior or What you need to ask

Using Fluentd and ES plugin versions