Duplicate records in ES

g3kr commented 3 years ago

(check apply)

[x] read the contribution guideline
[ ] (optional) already reported 3rd party upstream repository or mailing list if you use k8s addon or helm charts.

Problem

We are using the elasticsearch_genid filter to hash each record so duplicate records do not appear in ES.

However, we see that the filter is generating new hash ids for the same log message

<filter logs**>
    @type elasticsearch_genid
    hash_id_key generated_hash
    #use_entire_record true
    use_record_as_seed true
    record_keys message,service
    separator _
    hash_type sha256
  </filter>
  <match logs**>
      @type copy
      <store>
        @type elasticsearch
        host "#{ENV['ES_HOSTNAME']}"
        port 9243
        user "#{ENV['ES_USERNAME']}"
        password "#{ENV['ES_PASSWORD']}"
        scheme https
        with_transporter_log true
        @log_level debug
        ssl_verify false
        ssl_version TLSv1_2
        index_name ${indexprefix}
        reconnect_on_error true
        reload_connections false
        reload_on_failure true
        suppress_type_name true
        request_timeout 30s
        prefer_oj_serializer true
        type_name _doc

        # prevent duplicate log entries and updates
        id_key generated_hash
        #remove_keys generated_hash, date
        #write_operation create
        <buffer indexprefix>
          @type "file"
          path "#{ENV['BufferPath']}"
          flush_thread_count 10
          flush_mode interval
          flush_interval 30s
          flush_at_shutdown true
          overflow_action throw_exception
          compress gzip
          retry_forever true
          retry_type periodic
          retry_wait 30s
          chunk_limit_size 64MB
          total_limit_size 64GB
        </buffer>
        </store>
    </match>

...

Steps to replicate

The log message that was sent

 {"message":"This is test message - 1", "service": "Ltest",  "instant": {"epochSecond": 1624386995,"nanoOfSecond": 7000000}}
{"message":"This is test message - 1", "service": "Ltest",  "instant": {"epochSecond": 1624386995,"nanoOfSecond": 7000000}}
{"message":"This is test message - 1", "service": "Ltest",  "instant": {"epochSecond": 1624386995,"nanoOfSecond": 7000000}}

Expected Behavior or What you need to ask

Instead of treating this as duplicates, a hash id is generated for each of this message and 3 records are inserted to ES. What am I missing?

Using Fluentd and ES plugin versions

Fluentd v1.11.4 ES plugin 4.0.7 Elasticsearch 7.6.1

g3kr commented 3 years ago

@cosmo0920 Any thoughts here?

g3kr commented 3 years ago

Was able to resolve this finally. The documentation is not clear and it does it not explicitly say if you use use_entire_record you still need to use use_record_as_seed. The below config works

<filter **>
    @type elasticsearch_genid
    hash_id_key _hash  # storing generated hash id key (default is _hash)
    use_record_as_seed true
    record_keys []
    use_entire_record true
    separator _
    hash_type sha256
    include_time_in_seed false
    include_tag_in_seed false
   </filter>

uken / fluent-plugin-elasticsearch