uken / fluent-plugin-elasticsearch

Apache License 2.0
890 stars 310 forks source link

Retry on bad request (resource_already_exists_exception) should not happen #983

Open applike-ss opened 2 years ago

applike-ss commented 2 years ago

(check apply)

Problem

We're facing the following error message in fluentd and it states that fluentds elasticsearch output plugin tried to create a datastream which was already created (likely by a different fluentd pod in our kubernetes cluster).

2022-08-30 09:54:49 +0000 [warn]: #0 Could not communicate to Elasticsearch, resetting connection and trying again. [400] {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"data_stream [my-data-stream] already exists"}],"type":"resource_already_exists_exception","reason":"data_stream [my-data-stream] already exists"},"status":400}
2022-08-30 09:54:49 +0000 [warn]: #0 Remaining retry: 2. Retry to communicate after 256 second(s).

Steps to replicate

This is a race condition, so might be hard to replicate.

Expected Behavior or What you need to ask

The plugin should be checking on creation of resources if it gets a resource_already_exists_exception (http status 400 is returned, but that's not always a resource_already_exists_exception). If it gets this exception, it should either assume the resource indeed is already existent or if we are paranoid, then it should query for the to be created resource. Eventually there should not be any error log for a resource that should be created, but does already exist. ...

Using Fluentd and ES plugin versions

LOCAL GEMS

abbrev (default: 0.1.0) addressable (2.8.1) base64 (default: 0.1.1) benchmark (default: 0.2.0) bigdecimal (default: 3.1.1) bundler (default: 2.3.7, 2.2.24, 2.2.15) cgi (default: 0.3.1) concurrent-ruby (1.1.10) cool.io (1.7.1) csv (default: 3.2.2) date (default: 3.2.2) delegate (default: 0.2.0) did_you_mean (default: 1.6.1) digest (default: 3.1.0) domain_name (0.5.20190701) drb (default: 2.1.0) elasticsearch (7.17.1) elasticsearch-api (7.17.1) elasticsearch-transport (7.17.1) elasticsearch-xpack (7.17.1) english (default: 0.7.1) erb (default: 2.2.3) error_highlight (default: 0.3.0) etc (default: 1.3.0) excon (0.92.4) faraday (1.10.2) faraday-em_http (1.0.0) faraday-em_synchrony (1.0.0) faraday-excon (1.1.0) faraday-httpclient (1.0.1) faraday-multipart (1.0.4) faraday-net_http (1.0.1) faraday-net_http_persistent (1.2.0) faraday-patron (1.0.0) faraday-rack (1.0.0) faraday-retry (1.0.3) fcntl (default: 1.0.1) ffi (1.15.5) ffi-compiler (1.0.1) fiddle (default: 1.1.0) fileutils (default: 1.6.0) find (default: 0.1.1) fluent-config-regexp-type (1.0.0) fluent-plugin-concat (2.5.0) fluent-plugin-dedot_filter (1.0.0) fluent-plugin-detect-exceptions (0.0.14) fluent-plugin-elasticsearch (5.2.3) fluent-plugin-grafana-loki (1.2.18) fluent-plugin-grok-parser (2.6.2) fluent-plugin-json-in-json-2 (1.0.2) fluent-plugin-kubernetes_metadata_filter (2.13.0) fluent-plugin-multi-format-parser (1.0.0) fluent-plugin-parser-cri (0.1.1) fluent-plugin-prometheus (2.0.3) fluent-plugin-record-modifier (2.1.0) fluent-plugin-rewrite-tag-filter (2.4.0) fluent-plugin-systemd (1.0.5) fluentd (1.15.2) forwardable (default: 1.3.2) getoptlong (default: 0.1.1) http (4.4.1) http-accept (1.7.0) http-cookie (1.0.5) http-form_data (2.3.0) http-parser (1.2.3) http_parser.rb (0.8.0) io-console (default: 0.5.11) io-nonblock (default: 0.1.0) io-wait (default: 0.2.1) ipaddr (default: 1.2.4) irb (default: 1.4.1) json (default: 2.6.1) jsonpath (1.1.2) kubeclient (4.9.3) logger (default: 1.5.0) lru_redux (1.1.0) mime-types (3.4.1) mime-types-data (3.2022.0105) msgpack (1.5.6) multi_json (1.15.0) multipart-post (2.2.3) mutex_m (default: 0.1.1) net-http (default: 0.2.0) net-protocol (default: 0.1.2) netrc (0.11.0) nkf (default: 0.1.1) observer (default: 0.1.1) oj (3.13.21) open-uri (default: 0.2.0) open3 (default: 0.1.1) openssl (default: 3.0.0) optparse (default: 0.2.0) ostruct (default: 0.5.2) pathname (default: 0.2.0) pp (default: 0.3.0) prettyprint (default: 0.1.1) prometheus-client (4.0.0) pstore (default: 0.1.1) psych (default: 4.0.3) public_suffix (5.0.0) racc (default: 1.6.0) rake (13.0.6) rdoc (default: 6.4.0) readline (default: 0.0.3) readline-ext (default: 0.1.4) recursive-open-struct (1.1.3) reline (default: 0.3.0) resolv (default: 0.2.1) resolv-replace (default: 0.1.0) rest-client (2.1.0) rexml (3.2.5) rinda (default: 0.1.1) ruby2_keywords (default: 0.0.5) securerandom (default: 0.1.1) serverengine (2.3.0) set (default: 1.0.2) shellwords (default: 0.1.0) sigdump (0.2.4) singleton (default: 0.1.1) stringio (default: 3.0.1) strptime (0.2.5) strscan (default: 3.0.1) syslog (default: 0.1.0) systemd-journal (1.4.2) tempfile (default: 0.1.2) time (default: 0.2.0) timeout (default: 0.2.0) tmpdir (default: 0.1.2) tsort (default: 0.1.0) tzinfo (2.0.5) tzinfo-data (1.2022.3) un (default: 0.2.0) unf (0.1.4) unf_ext (0.0.8.2) uri (default: 0.11.0) weakref (default: 0.1.1) webrick (1.7.0) yajl-ruby (1.4.3) yaml (default: 0.2.0) zlib (default: 2.1.1)

cosmo0920 commented 1 year ago
  • let multiple fluentds start simultaneously to ingest logs (preferrably fixed input file or dummy data?)

This should not prevent duplicatation of indices name. It cannot to prevent to cause this issue.

cosmo0920 commented 1 year ago

For possible fix for operations:

Please try to set up aggregator node to aggregate logs from multiple Fluentd instances and send logs from only the aggregator.

ref: https://docs.fluentd.org/deployment/high-availability

applike-ss commented 1 year ago

While this may actually ensure that the problem does not happen, it can not really be wrong to have multiple fluentd' running. This seems to be happening because we do have the same application running on multiple nodes and it's resulting in the same datastream being used. These applications send their logs via fluent-bit to our fluentd' instances and if they resolve the dns name to different ips, this is what's happening. IMHO this is a bug in the addon and not in our setup.

cosmo0920 commented 1 year ago

Again, this should not be prevent from Fluentd plugins.

Fluentd does not know Elasticsearch indices' states until sending and receiving HTTP requests from Elasticsearch cluster. These sending terms does not handle synchronously and there is always possibility to cause race condition that confirms whether the indices are existing or not.

applike-ss commented 1 year ago

The point i am trying to make is: We shouldn't be logging an error for a resource that already exists. Also we shouldn't return an error to our caller if the resource already exists (which the exception is saying us, though it is a 400 - because ES does not expect us to try to create the resource multiple times).

I don't see why you would be against implementing a check for "this error says the resource already exists, so lets assume our create call was successful".

cosmo0920 commented 1 year ago

I don't see why you would be against implementing a check for "this error says the resource already exists, so lets assume our create call was successful".

I'm not much interest to implement for the corner case but we welcome the patch to fix for it.