logstash-plugins / logstash-output-http

Apache License 2.0
35 stars 82 forks source link

Fix retry indefinitely in termination process #129

Closed kaisecheng closed 2 years ago

kaisecheng commented 2 years ago

The plugin is blocking the shutdown process when url point to an invalid port, which is a retryable error that retry indefinitely

This commit adds a checking that checks pipeline shutdown state to quit retry loop This feature requires logstash-core that exposes shutdown_requested https://github.com/elastic/logstash/pull/13811

Fixed: #123

kaisecheng commented 2 years ago

@yaauie thanks for the review. Moving the status check to output base class make a lot of sense, especially involving many nil checking.

We should also discuss whether 2 attempts during a shutdown is sufficient for this particular plugin, and whether that should be configurable.

I try to find a balance between a reasonable shutdown time and the number of retries. The first try is 0, so "2" actually means a total of 3 attempts, which take around 15 seconds in the connection refuse case. I take k8s terminationGracePeriodSeconds as a reference, which kill the pod in 30 seconds (default value) if it can't finish the shutdown process. Changing the attempt to "3" would be pretty close to 30 seconds.

Regarding making it configurable, user who has enabled PQ would have less concern as the events will be retired in the next start, so memory queue user may want to have config. This is a config limit to retryable error in the shutdown scenario. One concern is automatic_retries could confuse with shutdown_retries. We need a better doc to explain the difference and this particular scenario. It is a nice to have feature for memory queue user.

yaauie commented 2 years ago

@kaisecheng with the current implementation, it is going to be very difficult to ensure that we bail before some arbitrary cutoff, especially when using the default non-batched behaviour and the cleverly randomized exponential backoff. A batch with the default 125 events, for example, can take 125*timeout to burn through its first attempt of each event, and an output that is already in a retry loop before the shutdown is requested can block on Queue#pop for up to 60s before we even get an event with which to run this new logic.

I think that this PR prevents us from blocking indefinitely, which is a good starting point and worth merging. I don't see a trivial way to change the existing implementation into one that is easily interruptible for shutdowns (e.g., finish or fail within X seconds of a shutdown being requested), without re-implementing a threadsafe queue whose elements retain a quarantine timestamp and whose pop method had new behaviour that is less blocking. But alas, that is much larger in scope and would certainly add risk.

kaisecheng commented 2 years ago

Indeed, the duration can be affected by too many factors, the type of failure, timeout setting, size of events... and this PR is not aiming to define a shutdown process in X seconds. Mostly I am thinking of the connection refuse case when the config point to a wrong URL, how many times it should retry, how many times it has retired before a complete shutdown, and how long does the shutdown take so that user considers it is a "bug" because of too long. I think trying for two times is too short which could finish in 2 seconds, so three attempts seem to be a compromise.