Addition of a resend_attempts counter to prevent output worker blocking.

smford commented 8 years ago

Preventing output blocking when the graphite server is offline by introducing a resend_attempts counter.

TLDR: This patch adds a resend_attempts option that ensures that metrics are discarded after a 3 failed attempts, or by whatever number is defined by the user in the logstash configuration.

There is is a problem with the previous logstash-output-graphite plugin that if the resend_on_failure was enabled and the graphite server went offline, logstash would infinitely attempt to resend metrics and end up blocking logstashs single output worker entirely (meaning logstash processed logs could not reach elasticsearch). Given that in some business environments graphites metrics are not considered vital, the option for discarding metrics after a few attempts is often considered acceptable, compared to lost or delayed logstash processed logs.

This could even result in logstash crashing as both logs and metrics get held in memory awaiting to be outputted by the now blocked single output worker. This issue would cause back pressure to the logstash-forwarder clients sending logs to logstash. This problem could compound with logstashs own logs becoming filled with millions of back pressure warnings until all local file storage is consumed.

I would also be argue that since the logstash metrics are not time stamped when sent to graphite, the discarding of stale metrics is far more favourable than having a large backlog of stale metrics being dumped in to graphite in a small window of time once the graphite server does come back online. Because graphite has a metric storage resolution/retention policy, a large surge/backlog of the same metric arriving from logstash within a window smaller than the resolution/retention policy, will simply mean the metrics get discarded anyway.

We have tested this extensively on logstash v1.5.4-1 and found logstash to be considerably more stable.

The resend_attempts option works well with a reduced reconnect_interval setting too, for example: reconnect_interval => 1 resend_attempts => 3 resend_on_failure => true

Note: I will create another PR in the future to deal with the other logstash-output-graphite problem where logstash cannot start without being able to connect to a working graphite server.

jordansissel commented 8 years ago

I haven't reviewed the code yet partly because there's no CLA signed. Can you do https://github.com/elastic/logstash/blob/master/CONTRIBUTING.md#contribution-steps ?

I'll do a review after CLA so I can look at the code :)

One comment on your description:

I would also be argue that since the logstash metrics are not time stamped when sent to graphite

I dont' think this is true. Logstash does use the timestamp from the event when sending it to graphite:

The text sent over tcp includes a timestamp: https://github.com/logstash-plugins/logstash-output-graphite/blob/master/lib/logstash/outputs/graphite.rb#L187
And that timestamp is unix epoch in seconds: https://github.com/logstash-plugins/logstash-output-graphite/blob/master/lib/logstash/outputs/graphite.rb#L173

smford commented 8 years ago

Hello Jordan,

Thanks for the feedback.

Concerning the CLA: I am getting my work to sign the CLA, we'll likely we raising some more contributions by a few other people. It'll take a couple days or so to organise.

Graphite timestamps: Thanks for that clarification.

elasticsearch-release commented 8 years ago

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run; then say 'jenkins, test it'.

RickardCardell commented 8 years ago

Bump. How is it going with this PR?

smford commented 8 years ago

Just noticed this, I have signed the CLA so should move forward.

smford commented 8 years ago

jenkins, test it

TommySedin commented 7 years ago

Any news on this? We're having a very similar problem..

smford commented 7 years ago

jenkins, test it.

@jordansissel a few people seem interested in this - the CLAs been signed since July 2016 - can this move forward?

karmi commented 7 years ago

Hi @smford, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

smford commented 7 years ago

Signed this CLA 3 times! Go go go.

smford commented 7 years ago

@karmi Hello Karmi, I have signed the CLA a number of times, and based upon your last comment I signed again with the email address that matches this git accounts commit. How do we progress this?

TommySedin commented 7 years ago

@smford I'm not sure, but I think you can just do another "jenkins, test it".

smford commented 7 years ago

jenkins, test it

jordansissel commented 7 years ago

@smford I do not find your email in the CLA database. I have found some smford email address, but it is not the one you used in Git.

To find the one you used in git, see git log, or more directly for this patch, run this command:

% curl -s https://patch-diff.githubusercontent.com/raw/logstash-plugins/logstash-output-graphite/pull/21.patch | grep '^From:'

The above shows two lines with the same email (two commits, so far so good). When I search our CLA database for this email, it is not found.

smford commented 6 years ago

Hello, I have signed the CLA again. It should be go go go now.

I signed it with the email address from curl -s https://patch-diff.githubusercontent.com/raw/logstash-plugins/logstash-output-graphite/pull/21.patch | grep '^From:'

synFK commented 6 years ago

Any news here? Why can't this PR be merged?

logstash-plugins / logstash-output-graphite

Addition of a resend_attempts counter to prevent output worker blocking. #21