Open agriffin208 opened 4 years ago
I'd like to close this issue out in the near future.
Questions on my mind:
SendMetric
and see what we conclusions we can draw about how this should work.I guess it was just the one question 🤣
Quick observations:
SendMetric
can return an error if:
metric.Line
)trySendWith
and *RealLineHandler.HandleLine
)✅ We know that @agriffin208's PR to telegraf is designed to exclude the first case
With the second case, we'll have to dig deeper.
My intuition is that if the WF buffer is full, then telegraf starts filling up its buffer instead. I don't think the SDK expects a second layer of buffering to happen above it, but since Telegraf is doing this, that seems fine - we could pull @agriffin208's solution into our package and try to give it a reasonable name.
We may not ever add additional error scenarios to SendMetric
, but we should have an implementation that won't lead to odd behavior in telegraf if we do. Maybe this means that our logic should make buffering errors retryable, but future errors will be non-retryable by default.
note: http and auth errors happen in Flush
, and in Telegraf usage:
Write
, to try to unblock the SDK's bufferImmediateFlush = true
It looks to me like in these cases, Telegraf might re-send in full a partially-sent batch:
Write
is called with a batch of size N
Write
loops over batch, calling SendMetric
, and fails on metric A
Write
calls sender.Flush
Flush
succeeds, Write continues with the rest of the batch.Flush
returns an error, Telegraf retries the entire batch - which may be large, and which has probably been partially sent alreadySome of these are non-retryable auth errors. If Telegraf re-attempts these, that could lead to an infinite retry loop.
*RealLineHandler.report
- but which the SDK is already retryingFlush
errors can also happen for network timeouts.
I think this leads to double-buffering in Telegraf. report
will make a best-effort attempt to re-queue the metric, but will move on if the buffer is full. However, the returned error will probably make Telegraf retry the entire batch, resulting in double-reporting of metrics.
@astp-okta I saw that you made some changes in this area of the Telegraf Wavefront output plugin recently. No pressure, but if you have thoughts on the above, I'd love to hear your perspective.
See https://github.com/influxdata/telegraf/pull/8404
The wavefront output plugin for telegraf was continuously treating all errors as retryable.
Recently added in the PR mentioned above was functionality for the wavefront output to not retry for invalid (null) metric names.
However, seeking the wavefront-sdk-go to provide canonical set of which errors are retryable and which are not.