td-agent v3.7.1 ssl hostname does not match the server certificate

toyaser commented 4 years ago

(check apply)

[x] read the contribution guideline

Problem

We are using td-agent v3.7.1 This is using fluentd version 1.10.2 and fluent-plugin-elasticsearch v4.0.7

We have a 3 node local elasticsearch cluster setup where starting up the td-agent will continue to work for around 18-20 hours after which we start to see fluentd fail with the following error:

2020-05-26 17:37:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=16 next_retry_seconds=2020-05-29 17:18:06 +0000 chunk="5a6904a4700dc751015bf6f7fb2e0bc1" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.mydomain.io\", :port=>9200, :scheme=>\"https\", :user=>\"my_user\", :password=>\"obfuscated\"}): hostname \"10.0.0.1\" does not match the server certificate (OpenSSL::SSL::SSLError)"

2020-05-26 17:37:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=17 next_retry_seconds=2020-05-27 02:35:12 +0000 chunk="5a690483041ba29bda96202b35491072" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.mydomain.io\", :port=>9200, :scheme=>\"https\", :user=>\"my_user\", :password=>\"obfuscated\"}): hostname \"10.0.0.2\" does not match the server certificate (OpenSSL::SSL::SSLError)"

2020-05-26 17:37:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=18 next_retry_seconds=2020-05-27 11:17:37 +0000 chunk="5a69048c8e1d158c8826c73a15f903b0" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.mydomain.io\", :port=>9200, :scheme=>\"https\", :user=>\"my_user\", :password=>\"obfuscated\"}): hostname \"10.0.0.3\" does not match the server certificate (OpenSSL::SSL::SSLError)"

Steps to replicate

Leave td-agent running for long enough.

<source>
    @type tail
    @id in_tail_my_log_log
    path "D:/Logs/my-logs-*.log"
    pos_file "C:/opt/td-agent/etc/td-agent/pos_files/logs.pos"
    tag "my-logs.*"
    enable_watch_timer false
    enable_stat_watcher true
    read_from_head true
    <parse>
      @type "json"
    </parse>
  </source>
 <filter **>
    @type elasticsearch_genid
    hash_id_key _hash
 </filter>
 <match my-logs.**>
    @type copy
    <store>
      id_key _hash
      remove_keys _hash
      @type "elasticsearch"
      @log_level debug
      host "elasticsearch.mydomain.io"
      port 9200
      scheme https
      ssl_version TLSv1_2
      logstash_format true
      logstash_prefix "my-logs"
      logstash_dateformat "%Y.%m"
      include_tag_key true
      user "fluentd_user"
      password "XXXXX"
      type_name "_doc"
      tag_key "@log_name"
      <buffer>
        flush_thread_count 8
        flush_interval 5s
      </buffer>
    </store>
  </match>

Expected Behavior or What you need to ask

no need to restart fluentd

Using Fluentd and ES plugin versions

Fluentd or td-agent version: td-agent 3.1.1
Operating system: Windows Server 2016
Elasticsearch version: 6.2.0

Additional context

What is interesting, is that logs will be shipped consistently and then will suddenly stop working. Also to note we have 3 separate servers each shipping logs to the same elasticsearch cluster, and all 3 servers will eventually (around the same time) fail with the exact same reason.

A restart of the fluentd service gets rid of the issue, but any logs in the buffer are lost and manual recovery has to be done.

Another point to make is using a much older version of td-agent v3.1.1 which uses fluentd v1.0.2 and fluent-plugin-elasticsearch v2.4.0 works with no issues.

Using the old version of td-agent, we have been running for over a week with no issues.

cosmo0920 commented 4 years ago

A restart of the fluentd service gets rid of the issue, but any logs in the buffer are lost and manual recovery has to be done.

This should be occurred by using memory buffer. Using file buffer does not cause the buffer lost.

Another point to make is using a much older version of td-agent v3.1.1 which uses fluentd v1.0.2 and fluent-plugin-elasticsearch v2.4.0 works with no issues.

fluent-plugin-elasticsearch v2.4.0 is too old to investigating this issue.

Could you try fluent-plugin-elasticsearch v4.0.2? This version introduced using newer version TLS mechanism. I guess that this should be one of the culprit this issue.

Or, you might ought to specify ssl_max_version TLSv1_3 and ssl_min_version TLSv1_2.

toyaser commented 4 years ago

@cosmo0920 Thank you very much for your suggestions. I will take a look at testing them.

I had a minor question about the file buffer it specifically says not to use with remote file systems (https://docs.fluentd.org/buffer/file#limitation), are you aware of the issues they are talking about? (we are using EFS)

cosmo0920 commented 4 years ago

Remote filesystem should cause throughput issue. But, I'm not aware of data loss. Actually, I didn't run Fluentd's file buffer on such remote filesystem....

toyaser commented 4 years ago

thanks @cosmo0920 . So just to confirm, the plugins behaviour of putting things in the buffer and updating the pos file immediately, even if the data is not shipped to elastic, is the correct behaviour?

cosmo0920 commented 4 years ago

So just to confirm, the plugins behaviour of putting things in the buffer and updating the pos file immediately, even if the data is not shipped to elastic, is the correct behaviour?

As you writing in configuration,

      <buffer>
        flush_thread_count 8
        flush_interval 5s
      </buffer>

ES plugin should flush every 5 seconds. This is correct behavior. ES plugin does not #process method, so it cannot flush immediately. Every flush operates with BULK api. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

i-skrypnyk commented 4 years ago

We were experiencing the same issue with fluentd v1.10.3 and fluent-plugin-elasticsearch v4.0.9

As is turned out, the cause for us was in the default value of the 'reload_connections' option, which we didn't specify. After 10k requests, elasticsearch-transport send a request to https://es_host:9200/_nodes/http to reload host list and suddenly started using IPs instead of hostnames. Our certificates have a hostname wildcard, not an IP wildcard.

Setting reload_connections false fixed it for us, but this is more of a workaround.

ES gems versions are: gem 'elasticsearch', '6.8.2' gem 'elasticsearch-api', '6.8.2' gem 'elasticsearch-transport', '6.8.2'

i-skrypnyk commented 4 years ago

There is a proper fix for this: https://github.com/elastic/elasticsearch/pull/32806

Add '-Des.http.cname_in_publish_address=true' property to ES, which will change publish_address field format from "ip:port" to "hostname/ip:port", and this is already supported by fluentd ES gems.

toyaser commented 4 years ago

@i-skrypnyk thank you very much for posting your findings here. I will test on our end and make sure i confirm my findings here.

uken / fluent-plugin-elasticsearch