SDK to provide canonical retriable given error

agriffin208 commented 4 years ago

See https://github.com/influxdata/telegraf/pull/8404

The wavefront output plugin for telegraf was continuously treating all errors as retryable.

Recently added in the PR mentioned above was functionality for the wavefront output to not retry for invalid (null) metric names.

However, seeking the wavefront-sdk-go to provide canonical set of which errors are retryable and which are not.

LukeWinikates commented 1 year ago

I'd like to close this issue out in the near future.

Questions on my mind:

Because the wavefront sender buffers errors that it thinks are retryable (always?), maybe the SDK only bubbles up non-retryable errors at this point. Put another way, I think it would be worthwhile for us to take a deeper look at the different error cases from SendMetric and see what we conclusions we can draw about how this should work.

I guess it was just the one question 🤣

LukeWinikates commented 1 year ago

Quick observations:

SendMetric can return an error if:

the metric name is empty (via metric.Line)
the channel buffer is full (via trySendWith and *RealLineHandler.HandleLine)

Case 1: empty metric names

✅ We know that @agriffin208's PR to telegraf is designed to exclude the first case

Case 2: Buffer errors

With the second case, we'll have to dig deeper.

My intuition is that if the WF buffer is full, then telegraf starts filling up its buffer instead. I don't think the SDK expects a second layer of buffering to happen above it, but since Telegraf is doing this, that seems fine - we could pull @agriffin208's solution into our package and try to give it a reasonable name.

We may not ever add additional error scenarios to SendMetric, but we should have an implementation that won't lead to odd behavior in telegraf if we do. Maybe this means that our logic should make buffering errors retryable, but future errors will be non-retryable by default.

Case 3: Flush errors

note: http and auth errors happen in Flush, and in Telegraf usage:

Flush by default happens quietly in the background
but some proactive flushing can happen in Write, to try to unblock the SDK's buffer
and a blocking flush call happens if ImmediateFlush = true

3a: partially flushed batches

It looks to me like in these cases, Telegraf might re-send in full a partially-sent batch:

Write is called with a batch of size N
Write loops over batch, calling SendMetric, and fails on metric A
if the error is retryable (a buffering error) , Write calls sender.Flush
if Flush succeeds, Write continues with the rest of the batch.
but if Flush returns an error, Telegraf retries the entire batch - which may be large, and which has probably been partially sent already

3b: non-retryable auth errors treated as retryable

Some of these are non-retryable auth errors. If Telegraf re-attempts these, that could lead to an infinite retry loop.

3c: potentially retriable, non-auth flush errors from `*RealLineHandler.report` - but which the SDK is already retrying

Flush errors can also happen for network timeouts. I think this leads to double-buffering in Telegraf. report will make a best-effort attempt to re-queue the metric, but will move on if the buffer is full. However, the returned error will probably make Telegraf retry the entire batch, resulting in double-reporting of metrics.

LukeWinikates commented 1 year ago

@astp-okta I saw that you made some changes in this area of the Telegraf Wavefront output plugin recently. No pressure, but if you have thoughts on the above, I'd love to hear your perspective.

wavefrontHQ / wavefront-sdk-go