uken / fluent-plugin-elasticsearch

Apache License 2.0
891 stars 310 forks source link

Incorrect number of chunk retries #916

Open g3kr opened 3 years ago

g3kr commented 3 years ago

(check apply)

Problem

We have the following output configuration for our log aggregator

<match kubeMsgs.** >
        @id es_output
        @type elasticsearch
        host "#{ENV['ES_HOSTNAME']}"
        port 9243
        user "#{ENV['ES_USERNAME']}"
        password "#{ENV['ES_PASSWORD']}"
        # prevent duplicate log entries and updates
        id_key _hash
        #remove_keys _hash
        scheme https
        with_transporter_log true
        @log_level debug
        ssl_verify false
        ssl_version TLSv1_2
        index_name ${indexprefix}
        reconnect_on_error true
        reload_connections false
        reload_on_failure true
        suppress_type_name true
        request_timeout 60s
        prefer_oj_serializer true
        type_name _doc
        bulk_message_request_threshold -1
        slow_flush_log_threshold 30.0
        #write_operation create
        <buffer indexprefix>
          @type "file"
          path "#{ENV['BufferPath']}"
          flush_thread_count 4
          #chunk+enqueue
          flush_mode interval
          flush_interval 15s
          chunk_limit_size 8MB
          total_limit_size 8GB
          flush_at_shutdown true
          overflow_action drop_oldest_chunk
          #retry properties
          retry_wait 10
          retry_timeout 2m
          retry_type exponential_backoff             
          retry_max_interval 60
          retry_max_times 1    
          disable_chunk_backup true
        </buffer>
    </match>

whenever the buffer fails to flush, it automatically retries. Since we have the retry_max _times set to 1, we would expect it to only retry once. However we are seeing the below debug logs for a specific chunk ID.

2021-09-03 18:29:41 +0000 [debug]: #0 [es_output] Created new chunk chunk_id="5cb1b7b8f220e3710d84d03d4088fe1a" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables={:indexprefix=>"agg-test"}, seq=0>

2021-09-03 18:30:46 +0000 [debug]: #0 [es_output] taking back chunk for errors. chunk="5cb1b7b8f220e3710d84d03d4088fe1a"

2021-09-03 18:30:46 +0000 [warn]: #0 [es_output] failed to flush the buffer. retry_time=0 next_retry_seconds=2021-09-03 18:30:47.779936975 +0000 chunk="5cb1b7b8f220e3710d84d03d4088fe1a" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"test.us-east-1.aws.found.io\", :port=>9243, :scheme=>\"https\", :user=>\"g3kr\", :password=>\"obfuscated\"}): read timeout reached"

2021-09-03 18:31:49 +0000 [debug]: #0 [es_output] taking back chunk for errors. chunk="5cb1b7b8f220e3710d84d03d4088fe1a"

2021-09-03 18:31:49 +0000 [warn]: #0 [es_output] failed to flush the buffer. retry_time=1 next_retry_seconds=2021-09-03 18:31:50 6129049284365171323803/8796093022208000000000 +0000 chunk="5cb1b7b8f220e3710d84d03d4088fe1a" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"test.us-east-1.aws.found.io\", :port=>9243, :scheme=>\"https\", :user=>\"g3kr\", :password=>\"obfuscated\"}): read timeout reached"

2021-09-03 18:32:52 +0000 [debug]: #0 [es_output] taking back chunk for errors. chunk="5cb1b7b8f220e3710d84d03d4088fe1a"

2021-09-03 18:32:52 +0000 [warn]: #0 [es_output] failed to flush the buffer. retry_time=1 next_retry_seconds=2021-09-03 18:32:53 3932328916113770970433/4398046511104000000000 +0000 chunk="5cb1b7b8f220e3710d84d03d4088fe1a" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"test.us-east-1.aws.found.io\", :port=>9243, :scheme=>\"https\", :user=>\"g3kr\", :password=>\"obfuscated\"}): read timeout reached"

2021-09-03 18:34:05 +0000 [debug]: #0 [es_output] taking back chunk for errors. chunk="5cb1b7b8f220e3710d84d03d4088fe1a"

The last 2 retries happened with retry_time=1 the next_retry_seconds keeps incrementing irrespective the retry_max_times. Is this a bug? ...

Steps to replicate

Reproduce this with bombarding log messages to the log aggregator causing buffer failures and chunk retries

drop the chunk after 1 retry

Using Fluentd and ES plugin versions

cosmo0920 commented 3 years ago

The last 2 retries happened with retry_time=1 the next_retry_seconds keeps incrementing irrespective the retry_max_times. Is this a bug?

Retrying mechanism is relying on Fluentd core implementation. In ES plugin, this buffering/retrying mechanism just inherits from Fluentd Core's Fluent::Plugin::Output class. Could you report this issue on Fluentd repository instead of here?