wavefrontHQ / wavefront-sdk-go

Wavefront Core Go SDK
Apache License 2.0
4 stars 21 forks source link

SDK to provide canonical retriable given error #60

Open agriffin208 opened 4 years ago

agriffin208 commented 4 years ago

See https://github.com/influxdata/telegraf/pull/8404

The wavefront output plugin for telegraf was continuously treating all errors as retryable.

Recently added in the PR mentioned above was functionality for the wavefront output to not retry for invalid (null) metric names.

However, seeking the wavefront-sdk-go to provide canonical set of which errors are retryable and which are not.

LukeWinikates commented 1 year ago

I'd like to close this issue out in the near future.

Questions on my mind:

  1. Because the wavefront sender buffers errors that it thinks are retryable (always?), maybe the SDK only bubbles up non-retryable errors at this point. Put another way, I think it would be worthwhile for us to take a deeper look at the different error cases from SendMetric and see what we conclusions we can draw about how this should work.

I guess it was just the one question 🤣

LukeWinikates commented 1 year ago

Quick observations:

SendMetric can return an error if:

Case 1: empty metric names

✅ We know that @agriffin208's PR to telegraf is designed to exclude the first case

Case 2: Buffer errors

With the second case, we'll have to dig deeper.

My intuition is that if the WF buffer is full, then telegraf starts filling up its buffer instead. I don't think the SDK expects a second layer of buffering to happen above it, but since Telegraf is doing this, that seems fine - we could pull @agriffin208's solution into our package and try to give it a reasonable name.

We may not ever add additional error scenarios to SendMetric, but we should have an implementation that won't lead to odd behavior in telegraf if we do. Maybe this means that our logic should make buffering errors retryable, but future errors will be non-retryable by default.

Case 3: Flush errors

note: http and auth errors happen in Flush, and in Telegraf usage:

3a: partially flushed batches

It looks to me like in these cases, Telegraf might re-send in full a partially-sent batch:

3b: non-retryable auth errors treated as retryable

Some of these are non-retryable auth errors. If Telegraf re-attempts these, that could lead to an infinite retry loop.

3c: potentially retriable, non-auth flush errors from *RealLineHandler.report - but which the SDK is already retrying

Flush errors can also happen for network timeouts. I think this leads to double-buffering in Telegraf. report will make a best-effort attempt to re-queue the metric, but will move on if the buffer is full. However, the returned error will probably make Telegraf retry the entire batch, resulting in double-reporting of metrics.

LukeWinikates commented 1 year ago

@astp-okta I saw that you made some changes in this area of the Telegraf Wavefront output plugin recently. No pressure, but if you have thoughts on the above, I'd love to hear your perspective.