Open g3kr opened 3 years ago
The last 2 retries happened with retry_time=1 the next_retry_seconds keeps incrementing irrespective the retry_max_times. Is this a bug?
Retrying mechanism is relying on Fluentd core implementation. In ES plugin, this buffering/retrying mechanism just inherits from Fluentd Core's Fluent::Plugin::Output class. Could you report this issue on Fluentd repository instead of here?
(check apply)
Problem
We have the following output configuration for our log aggregator
whenever the buffer fails to flush, it automatically retries. Since we have the
retry_max _times
set to 1, we would expect it to only retry once. However we are seeing the below debug logs for a specific chunk ID.2021-09-03 18:29:41 +0000 [debug]: #0 [es_output] Created new chunk chunk_id="5cb1b7b8f220e3710d84d03d4088fe1a" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables={:indexprefix=>"agg-test"}, seq=0>
2021-09-03 18:30:46 +0000 [debug]: #0 [es_output] taking back chunk for errors. chunk="5cb1b7b8f220e3710d84d03d4088fe1a"
2021-09-03 18:30:46 +0000 [warn]: #0 [es_output] failed to flush the buffer. retry_time=0 next_retry_seconds=2021-09-03 18:30:47.779936975 +0000 chunk="5cb1b7b8f220e3710d84d03d4088fe1a" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"test.us-east-1.aws.found.io\", :port=>9243, :scheme=>\"https\", :user=>\"g3kr\", :password=>\"obfuscated\"}): read timeout reached"
2021-09-03 18:31:49 +0000 [debug]: #0 [es_output] taking back chunk for errors. chunk="5cb1b7b8f220e3710d84d03d4088fe1a"
2021-09-03 18:31:49 +0000 [warn]: #0 [es_output] failed to flush the buffer. retry_time=1 next_retry_seconds=2021-09-03 18:31:50 6129049284365171323803/8796093022208000000000 +0000 chunk="5cb1b7b8f220e3710d84d03d4088fe1a" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"test.us-east-1.aws.found.io\", :port=>9243, :scheme=>\"https\", :user=>\"g3kr\", :password=>\"obfuscated\"}): read timeout reached"
2021-09-03 18:32:52 +0000 [debug]: #0 [es_output] taking back chunk for errors. chunk="5cb1b7b8f220e3710d84d03d4088fe1a"
2021-09-03 18:32:52 +0000 [warn]: #0 [es_output] failed to flush the buffer. retry_time=1 next_retry_seconds=2021-09-03 18:32:53 3932328916113770970433/4398046511104000000000 +0000 chunk="5cb1b7b8f220e3710d84d03d4088fe1a" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"test.us-east-1.aws.found.io\", :port=>9243, :scheme=>\"https\", :user=>\"g3kr\", :password=>\"obfuscated\"}): read timeout reached"
2021-09-03 18:34:05 +0000 [debug]: #0 [es_output] taking back chunk for errors. chunk="5cb1b7b8f220e3710d84d03d4088fe1a"
The last 2 retries happened with
retry_time=1
thenext_retry_seconds
keeps incrementing irrespective theretry_max_times
. Is this a bug? ...Steps to replicate
Reproduce this with bombarding log messages to the log aggregator causing buffer failures and chunk retries
drop the chunk after 1 retry
Using Fluentd and ES plugin versions