Open toyaser opened 4 years ago
A restart of the fluentd service gets rid of the issue, but any logs in the buffer are lost and manual recovery has to be done.
This should be occurred by using memory buffer. Using file buffer does not cause the buffer lost.
Another point to make is using a much older version of td-agent v3.1.1 which uses fluentd v1.0.2 and fluent-plugin-elasticsearch v2.4.0 works with no issues.
fluent-plugin-elasticsearch v2.4.0 is too old to investigating this issue.
Could you try fluent-plugin-elasticsearch v4.0.2? This version introduced using newer version TLS mechanism. I guess that this should be one of the culprit this issue.
Or, you might ought to specify ssl_max_version TLSv1_3
and ssl_min_version TLSv1_2
.
@cosmo0920 Thank you very much for your suggestions. I will take a look at testing them.
I had a minor question about the file buffer it specifically says not to use with remote file systems (https://docs.fluentd.org/buffer/file#limitation), are you aware of the issues they are talking about? (we are using EFS)
Remote filesystem should cause throughput issue. But, I'm not aware of data loss. Actually, I didn't run Fluentd's file buffer on such remote filesystem....
thanks @cosmo0920 . So just to confirm, the plugins behaviour of putting things in the buffer and updating the pos file immediately, even if the data is not shipped to elastic, is the correct behaviour?
So just to confirm, the plugins behaviour of putting things in the buffer and updating the pos file immediately, even if the data is not shipped to elastic, is the correct behaviour?
As you writing in configuration,
<buffer>
flush_thread_count 8
flush_interval 5s
</buffer>
ES plugin should flush every 5 seconds. This is correct behavior.
ES plugin does not #process
method, so it cannot flush immediately. Every flush operates with BULK api.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
We were experiencing the same issue with fluentd v1.10.3 and fluent-plugin-elasticsearch v4.0.9
As is turned out, the cause for us was in the default value of the 'reload_connections' option, which we didn't specify. After 10k requests, elasticsearch-transport send a request to https://es_host:9200/_nodes/http to reload host list and suddenly started using IPs instead of hostnames. Our certificates have a hostname wildcard, not an IP wildcard.
Setting
reload_connections false
fixed it for us, but this is more of a workaround.
ES gems versions are: gem 'elasticsearch', '6.8.2' gem 'elasticsearch-api', '6.8.2' gem 'elasticsearch-transport', '6.8.2'
There is a proper fix for this: https://github.com/elastic/elasticsearch/pull/32806
Add '-Des.http.cname_in_publish_address=true' property to ES, which will change publish_address field format from "ip:port" to "hostname/ip:port", and this is already supported by fluentd ES gems.
@i-skrypnyk thank you very much for posting your findings here. I will test on our end and make sure i confirm my findings here.
(check apply)
Problem
We are using td-agent v3.7.1 This is using fluentd version 1.10.2 and fluent-plugin-elasticsearch v4.0.7
We have a 3 node local elasticsearch cluster setup where starting up the td-agent will continue to work for around 18-20 hours after which we start to see fluentd fail with the following error:
2020-05-26 17:37:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=16 next_retry_seconds=2020-05-29 17:18:06 +0000 chunk="5a6904a4700dc751015bf6f7fb2e0bc1" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.mydomain.io\", :port=>9200, :scheme=>\"https\", :user=>\"my_user\", :password=>\"obfuscated\"}): hostname \"10.0.0.1\" does not match the server certificate (OpenSSL::SSL::SSLError)"
2020-05-26 17:37:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=17 next_retry_seconds=2020-05-27 02:35:12 +0000 chunk="5a690483041ba29bda96202b35491072" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.mydomain.io\", :port=>9200, :scheme=>\"https\", :user=>\"my_user\", :password=>\"obfuscated\"}): hostname \"10.0.0.2\" does not match the server certificate (OpenSSL::SSL::SSLError)"
2020-05-26 17:37:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=18 next_retry_seconds=2020-05-27 11:17:37 +0000 chunk="5a69048c8e1d158c8826c73a15f903b0" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.mydomain.io\", :port=>9200, :scheme=>\"https\", :user=>\"my_user\", :password=>\"obfuscated\"}): hostname \"10.0.0.3\" does not match the server certificate (OpenSSL::SSL::SSLError)"
Steps to replicate
Leave td-agent running for long enough.
Expected Behavior or What you need to ask
no need to restart fluentd
Using Fluentd and ES plugin versions
Additional context
What is interesting, is that logs will be shipped consistently and then will suddenly stop working. Also to note we have 3 separate servers each shipping logs to the same elasticsearch cluster, and all 3 servers will eventually (around the same time) fail with the exact same reason.
A restart of the fluentd service gets rid of the issue, but any logs in the buffer are lost and manual recovery has to be done.
Another point to make is using a much older version of td-agent v3.1.1 which uses fluentd v1.0.2 and fluent-plugin-elasticsearch v2.4.0 works with no issues.
Using the old version of td-agent, we have been running for over a week with no issues.