Retry on bad request (resource_already_exists_exception) should not happen

applike-ss commented 2 years ago

(check apply)

[ ] read the contribution guideline
[ ] (optional) already reported 3rd party upstream repository or mailing list if you use k8s addon or helm charts.

Problem

We're facing the following error message in fluentd and it states that fluentds elasticsearch output plugin tried to create a datastream which was already created (likely by a different fluentd pod in our kubernetes cluster).

2022-08-30 09:54:49 +0000 [warn]: #0 Could not communicate to Elasticsearch, resetting connection and trying again. [400] {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"data_stream [my-data-stream] already exists"}],"type":"resource_already_exists_exception","reason":"data_stream [my-data-stream] already exists"},"status":400}
2022-08-30 09:54:49 +0000 [warn]: #0 Remaining retry: 2. Retry to communicate after 256 second(s).

Steps to replicate

This is a race condition, so might be hard to replicate.

have an elasticsearch cluster
configure fluentd to push logs to it via data stream output plugin
ensure the datastream did not exist already
let multiple fluentds start simultaneously to ingest logs (preferrably fixed input file or dummy data?)

Expected Behavior or What you need to ask

The plugin should be checking on creation of resources if it gets a resource_already_exists_exception (http status 400 is returned, but that's not always a resource_already_exists_exception). If it gets this exception, it should either assume the resource indeed is already existent or if we are paranoid, then it should query for the to be created resource. Eventually there should not be any error log for a resource that should be created, but does already exist. ...

Using Fluentd and ES plugin versions

OS version - Debian GNU/Linux 11 (bullseye)
Bare Metal or within Docker or Kubernetes or others? - Docker (v1.15-debian-elasticsearch7-1)
Fluentd v0.12 or v0.14/v1.0 - 1.15.2
- root@7d41daf401df:/home/fluent# fluentd --version fluentd 1.15.2

ES plugin 3.x.y/2.x.y or 1.x.y - 5.2.3


root@7d41daf401df:/home/fluent# fluent-gem list

LOCAL GEMS

abbrev (default: 0.1.0) addressable (2.8.1) base64 (default: 0.1.1) benchmark (default: 0.2.0) bigdecimal (default: 3.1.1) bundler (default: 2.3.7, 2.2.24, 2.2.15) cgi (default: 0.3.1) concurrent-ruby (1.1.10) cool.io (1.7.1) csv (default: 3.2.2) date (default: 3.2.2) delegate (default: 0.2.0) did_you_mean (default: 1.6.1) digest (default: 3.1.0) domain_name (0.5.20190701) drb (default: 2.1.0) elasticsearch (7.17.1) elasticsearch-api (7.17.1) elasticsearch-transport (7.17.1) elasticsearch-xpack (7.17.1) english (default: 0.7.1) erb (default: 2.2.3) error_highlight (default: 0.3.0) etc (default: 1.3.0) excon (0.92.4) faraday (1.10.2) faraday-em_http (1.0.0) faraday-em_synchrony (1.0.0) faraday-excon (1.1.0) faraday-httpclient (1.0.1) faraday-multipart (1.0.4) faraday-net_http (1.0.1) faraday-net_http_persistent (1.2.0) faraday-patron (1.0.0) faraday-rack (1.0.0) faraday-retry (1.0.3) fcntl (default: 1.0.1) ffi (1.15.5) ffi-compiler (1.0.1) fiddle (default: 1.1.0) fileutils (default: 1.6.0) find (default: 0.1.1) fluent-config-regexp-type (1.0.0) fluent-plugin-concat (2.5.0) fluent-plugin-dedot_filter (1.0.0) fluent-plugin-detect-exceptions (0.0.14) fluent-plugin-elasticsearch (5.2.3) fluent-plugin-grafana-loki (1.2.18) fluent-plugin-grok-parser (2.6.2) fluent-plugin-json-in-json-2 (1.0.2) fluent-plugin-kubernetes_metadata_filter (2.13.0) fluent-plugin-multi-format-parser (1.0.0) fluent-plugin-parser-cri (0.1.1) fluent-plugin-prometheus (2.0.3) fluent-plugin-record-modifier (2.1.0) fluent-plugin-rewrite-tag-filter (2.4.0) fluent-plugin-systemd (1.0.5) fluentd (1.15.2) forwardable (default: 1.3.2) getoptlong (default: 0.1.1) http (4.4.1) http-accept (1.7.0) http-cookie (1.0.5) http-form_data (2.3.0) http-parser (1.2.3) http_parser.rb (0.8.0) io-console (default: 0.5.11) io-nonblock (default: 0.1.0) io-wait (default: 0.2.1) ipaddr (default: 1.2.4) irb (default: 1.4.1) json (default: 2.6.1) jsonpath (1.1.2) kubeclient (4.9.3) logger (default: 1.5.0) lru_redux (1.1.0) mime-types (3.4.1) mime-types-data (3.2022.0105) msgpack (1.5.6) multi_json (1.15.0) multipart-post (2.2.3) mutex_m (default: 0.1.1) net-http (default: 0.2.0) net-protocol (default: 0.1.2) netrc (0.11.0) nkf (default: 0.1.1) observer (default: 0.1.1) oj (3.13.21) open-uri (default: 0.2.0) open3 (default: 0.1.1) openssl (default: 3.0.0) optparse (default: 0.2.0) ostruct (default: 0.5.2) pathname (default: 0.2.0) pp (default: 0.3.0) prettyprint (default: 0.1.1) prometheus-client (4.0.0) pstore (default: 0.1.1) psych (default: 4.0.3) public_suffix (5.0.0) racc (default: 1.6.0) rake (13.0.6) rdoc (default: 6.4.0) readline (default: 0.0.3) readline-ext (default: 0.1.4) recursive-open-struct (1.1.3) reline (default: 0.3.0) resolv (default: 0.2.1) resolv-replace (default: 0.1.0) rest-client (2.1.0) rexml (3.2.5) rinda (default: 0.1.1) ruby2_keywords (default: 0.0.5) securerandom (default: 0.1.1) serverengine (2.3.0) set (default: 1.0.2) shellwords (default: 0.1.0) sigdump (0.2.4) singleton (default: 0.1.1) stringio (default: 3.0.1) strptime (0.2.5) strscan (default: 3.0.1) syslog (default: 0.1.0) systemd-journal (1.4.2) tempfile (default: 0.1.2) time (default: 0.2.0) timeout (default: 0.2.0) tmpdir (default: 0.1.2) tsort (default: 0.1.0) tzinfo (2.0.5) tzinfo-data (1.2022.3) un (default: 0.2.0) unf (0.1.4) unf_ext (0.0.8.2) uri (default: 0.11.0) weakref (default: 0.1.1) webrick (1.7.0) yajl-ruby (1.4.3) yaml (default: 0.2.0) zlib (default: 2.1.1)

cosmo0920 commented 1 year ago

let multiple fluentds start simultaneously to ingest logs (preferrably fixed input file or dummy data?)

This should not prevent duplicatation of indices name. It cannot to prevent to cause this issue.

cosmo0920 commented 1 year ago

For possible fix for operations:

Please try to set up aggregator node to aggregate logs from multiple Fluentd instances and send logs from only the aggregator.

ref: https://docs.fluentd.org/deployment/high-availability

applike-ss commented 1 year ago

While this may actually ensure that the problem does not happen, it can not really be wrong to have multiple fluentd' running. This seems to be happening because we do have the same application running on multiple nodes and it's resulting in the same datastream being used. These applications send their logs via fluent-bit to our fluentd' instances and if they resolve the dns name to different ips, this is what's happening. IMHO this is a bug in the addon and not in our setup.

cosmo0920 commented 1 year ago

Again, this should not be prevent from Fluentd plugins.

Fluentd does not know Elasticsearch indices' states until sending and receiving HTTP requests from Elasticsearch cluster. These sending terms does not handle synchronously and there is always possibility to cause race condition that confirms whether the indices are existing or not.

applike-ss commented 1 year ago

The point i am trying to make is: We shouldn't be logging an error for a resource that already exists. Also we shouldn't return an error to our caller if the resource already exists (which the exception is saying us, though it is a 400 - because ES does not expect us to try to create the resource multiple times).

I don't see why you would be against implementing a check for "this error says the resource already exists, so lets assume our create call was successful".

cosmo0920 commented 1 year ago

I don't see why you would be against implementing a check for "this error says the resource already exists, so lets assume our create call was successful".

I'm not much interest to implement for the corner case but we welcome the patch to fix for it.

uken / fluent-plugin-elasticsearch