influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.7k stars 5.59k forks source link

outputs(wavefront): inline retry logic may log `"Non-retryable error during Wavefront.Write: <nil>"` #14055

Open LukeWinikates opened 1 year ago

LukeWinikates commented 1 year ago

I'm noticed log lines like Non-retryable error during Wavefront.Write: <nil> in Telegraf v1.28.2.

There's a code path in (*Wavefront).Write where err can be reassigned to nil on line 154 if a second call to w.sender.SendMetric succeeds. If the 2nd attempt does not return an error, we don't want to log anything about an error.

https://github.com/influxdata/telegraf/blame/f4c56e1597956cd95eb37336e0c08ae361338aba/plugins/outputs/wavefront/wavefront.go#L146

The retry/flush logic has gotten complicated in this method. I'm a little uncertain about the flushing and returning on line 150, and about line 157 returning an error mid-batch.

I think the mid-batch error causes Telegraf to retry the entire batch on the next Write, and I think that may mean successfully sent metrics, unsendable metrics, and unsent metrics all get retried.

powersj commented 1 year ago

Hey @LukeWinikates,

Looking through this code, I can see what you have a few questions as I am not entirely clear looking back at this as well :)

To start, my suggestion would be to simplify this logic a bit, and test if the error is not retryable, if that is the case, print the existing error and debug info, and add a continue. Then we can handle the retryable errors outside the if statement and avoid the extra messages.

That should resolve the title of this issue at least, and you can continue on looking at the retryable error handling?

Thoughts?

LukeWinikates commented 1 year ago

Hi @powersj !

Yes, I think simplifying the overall logic is the way to go. I think the root of the problem is that the Wavefront SDK's interface was originally designed to transparently batch and flush metrics asynchronously, which is really not what Telegraf wants output plugins to do. After looking at this code a lot this week, I've had a couple of thoughts:

In a perfect world, the immediate_flush option should be the default - forget the SDK's periodic flushing, and just flush every batch synchronously every time.

I think the immediate_flush option isn't very obvious, but I wonder if a lot of the tweaks to the retrying logic wouldn't be needed if folks were using immediate_flush instead. I'm starting to think that all the special cases of retrying if the buffer is full are converging toward 'flush every batch every time' - the same goal as the immediate_flush option.

I've been thinking about what would be the safest, most considerate way to change the default behavior to immediate_flush = true.

Maybe the output plugin shouldn't be analyzing the errors for retryability and doing nested retries.

In a perfect world, if you're using the SDK's buffering, you tune the flush interval and the batch size so that it doesn't drop metrics under normal circumstances, and then if backpressure from the proxy causes the SDK to drop metrics, you accept that they're dropped.

The problem with that stance is that it's pretty hard to reason about the interactions of the buffering and retrying that both Telegraf and the Wavefront SDK are doing, and their potentially different flush intervals and batch sizes. If they thought about it for a while, I don't think anybody would actually come to the conclusion that they want two layers of buffering.

I'm not exactly sure what the refactoring PR will look like, but I these are my thoughts right now. I'm hoping to put together the PR sometime next week.

powersj commented 1 year ago

Thanks for taking a look at this and sharing your thoughts!

I've been thinking about what would be the safest, most considerate way to change the default behavior to immediate_flush = true.

While we certainly do not like to break or make changes to defaults, if the behavior certainly should be a specific way and/or we can ensure that we are not going to break existing users we could make a change in a new minor release.

I don't think anybody would actually come to the conclusion that they want two layers of buffering.

Agreed

Look forward to a PR!