Messages dropped on 400 from elastic

pstray commented 2 years ago

[x] read the contribution guideline

Problem

When Elastic returns a 400 status, massages are dropped instead of retried.

Steps to replicate

Any config that tries to deliver to elastic with buffering.

Expected Behavior or What you need to ask

I would expect messages to be retried, as many 400 responses from elastic are temporary.

Using Fluentd and ES plugin versions

OS version: Debian 10
Bare Metal
Fluentd: td-agent 4.2.0 fluentd 1.13.3 (12de3b5a260a174fe4a419036d6e2b2e18fe7497)
ES plugin (the one bundled with td-agent from http://packages.treasuredata.com/4/debian/buster/)
- elasticsearch (7.13.3)
- elasticsearch-api (7.13.3)
- elasticsearch-transport (7.13.3)
- fluent-plugin-elasticsearch (5.0.5)

cosmo0920 commented 2 years ago

From HTTP status 400, ES plugin cannot recover from the abnormal statuses.

https://github.com/uken/fluent-plugin-elasticsearch#with_transporter_log may help.

pstray commented 2 years ago

I know 400 is defined as 'Client Error', and as such fluentd should just give up... but it seems Elastic returns 400 errors for some errors that clearly are not a client error (or maybe they are, but if so, they are triggered by fluentd not handling an earlier server error response).

Anyway... In my case, fluentd tried to deliver messages to an elastic that couldn't create new indexes, and just threw away these messages because of a resulting 400 from elastic when submitting the messages. I don't know how this submission flow works exactly, but if fluentd just assumes that the creation of new indexes work and then tries to deliver to the index it assumes has been created, I understand why elastic returns a 400. In that case, it seems fluentd needs some handling of trying to deliver to nonexistent indexes.

brianjsw commented 6 months ago

We just lost a day's worth of messages because our Elastic cluster exceeded the maximum shard limit. We fixed and assumed messages would be re-delivered but seems like we were bit by the same issue. 400 is not always a permanent issue with data provided by the client. It can be a transient server side failure too.

uken / fluent-plugin-elasticsearch