open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.23k stars 765 forks source link

OTLP Exporter Retry and Persist Feature #4115

Open mic-max opened 1 year ago

mic-max commented 1 year ago

Feature Request

Is your feature request related to a problem?

When exporting data and a transient server issue prevents the request from being processed correctly the data will be lost. Or when a program is shutdown any data not yet exported before the process is terminated will be lost.

Describe the solution you'd like:

The data should be attempted to be exported again when the error is considered repeatable. On program shutdown data yet to be exported should attempt to do so after first saving to disk in case the transmission fails or does not have enough time to complete. Upon the next program execution the saved to disk telemetry will attempt to export. This will reduce the amount lost telemetry.

Additional Context

Add the ability to OTLP exporters to retry exports that fail in a defined way. This includes between program shutdowns by persisting the data to disk upon failure. This will help improve the reliability of OTel from the client's end.

Original GitHub Issue: https://github.com/open-telemetry/opentelemetry-dotnet/issues/1278

The first set of PRs will focus on a single to be decided section in the following matrix and follow-up PRs will be enabling the others, reusing as much code as reasonable.

Metrics Logs Traces
HTTP x x x
gRPC x x x

src/OpenTelemetry.Exporter.OpenTelemetryProtocol

PR Roadmap

  1. OTLP Exporter Options: Opt-in.
    1. FileBlobProvider:
      1. Path the folder location defined in Storage folder section of https://github.com/open-telemetry/opentelemetry-dotnet/issues/1278
      2. max size (is new or old data dropped?)
      3. maintenance period (the 2 minute default likely too long to release a lease)
      4. retention period
      5. write timeout
    2. RetryThreadPeriodMilliseconds (default of 1000? or combine into 1 field together with maintenance period?)
  2. Write
    1. Each ExportClient will be modified
    2. If a retryable error is returned.
      1. Log the error (info level).
      2. Create a blob. persistentBlobProvider.TryCreateBlob(data, RetryAfter.Milliseconds, out var blob);
    3. Non retryable error.
      1. Drop the data.
      2. Log the error (warning level).
      3. Increment a dropped item counter.
  3. Retry
    1. Implemented as a new thread that will wake up every XX milliseconds and perform a retry on all blobs it can get a lease for.
    2. foreach (var blobItem in persistentBlobProvider.GetBlobs()) { ... }
    3. blob.TryLease(1000); blob.TryRead(out var data);
    4. If successfully exported or response says to no longer retry this blob.
      1. Delete blob blob.TryDelete();
      2. Log the error (warning level).
      3. Increment a dropped item counter.
    5. If not successful, but another retryable error is returned
      1. Log the error (info level).
  4. Cleanup
    1. This part may already be taken care of by the FileBlobProvider and the above Retry scenario.
  5. Shutdown
    1. Write data to disk, then attempt to send that data.
    2. Exit grace period for Console-like apps. Point #3 in the original issue.
  6. Add Guards to FileBlobProvider interface. Ref
  7. Optimisations

Retryable Errors

Testing Strategy

Make use of the test/OpenTelemetry.Exporter.OpenTelemetryProtocol.Tests/MockCollectorIntegrationTests.cs class. Some example can be seen in this closed PR. Which the changes to that file made in that PR should be reusable.

References

reyang commented 1 year ago

For retry, we need to consider the following cases:

  1. HTTP 503 Server Too Busy - we don't want to retry too frequently, probably exponential backoff, plus respecting 'Retry-After' header if there is one.
  2. When we retry, what's ordering? E.g. for metrics it might make more sense to always prioritize the newest data. For logs, it might make sense to prioritize the older one?
  3. We might need to introduce some QoS metrics (e.g. SDK internal queue size, error count, retry count), these metrics might need to be treated as high priority, so if the exporter is too busy to handle all the data, at least we have these QoS data prioritized.
pjanotti commented 1 year ago

Confirming the end result here:

  1. OTLP Exporter Options: Opt-in.

All of the work planned for this issue stays as opt-in, correct?

vishweshbankwar commented 1 year ago

Confirming the end result here:

  1. OTLP Exporter Options: Opt-in.

All of the work planned for this issue stays as opt-in, correct?

That is correct. This will be an opt-in feature until we have spec.

alanwest commented 1 year ago

This will be an opt-in feature until we have spec.

Is there related spec work for these options in-progress already?

If a retryable error is returned. Log the error (info level). Create a blob.

A retry policy is useful independently from persistent storage. That is, the gRPC client (or Polly if using HTTP) could be configured with a retry policy which can handle transient network errors. This handling would be opaque to the OTLP exporter.

Do you plan to implement retry w/o also requiring the use of persistent storage?

cijothomas commented 1 year ago

Do you plan to implement retry w/o also requiring the use of persistent storage?

We should do this and is required by spec https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md#retry

Persistent storage is an optional, opt-in feature.

CodeBlanch commented 1 year ago

Just wanted to share a thought. @alanwest kind of scratched at this on the SIG yesterday, I think. What does "persistent storage" mean? In the scope of this work, it seems to me we persist the data for retries. I'm working on an exporter for some client teams some (maybe all) of which seem to want "persistent storage" but of the always & up-front variety. Meaning exporter just writes to disk and then some other thread tries to ship off the disk data on its own cadence.

Only sharing so we can be clear what kind of "persistent storage" we aim to support in OTLP and make sure the documentation is also clear 😄

Kadajski commented 1 year ago

Are there any plans still in motion regarding some kind of persistence support? Once OpenTelemetry.PersistentStorage is released is the intention to integrate it into the otlp exporter or create some persisted otlp exporter?

https://opentelemetry.io/docs/specs/otel/library-guidelines/#protocol-exporters

if an application’s telemetry data must be delivered to a remote backend that has no guaranteed availability the end user may choose to use a persistent local queue and an Exporter to retry sending on failures.

This is more or less the scenario I would like to cater for, similar to what @CodeBlanch mentioned. Similar to how the collector has this supported https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md#persistent-queue .

Right now the issue that I experience is if create my own BatchExportProcessor using a persisted local queue and try re-use the OtlpTraceExporter.Export function, I then need to serialize/deserialize System.Diagnostics.Activity to store it on some persistent queue somewhere, serializing the Activity object does not really seem like a viable option. The only other option I am left with, unless I am missing something, is to implement the entire OTLP exporter myself and do the mapping to the proto types and store the serialization of that format on my local file queue.

smoms commented 9 months ago

Am I correct that there is no retry by default? We would need to inject a retry policy by ourselves in the HttpClientFactory (if using Http) with e.g. Polly? If so we would need to catch the error codes described in the spec? tks

cijothomas commented 9 months ago

Am I correct that there is no retry by default? We would need to inject a retry policy by ourselves in the HttpClientFactory (if using Http) with e.g. Polly? If so we would need to catch the error codes described in the spec? tks

True. See https://github.com/open-telemetry/opentelemetry-dotnet/issues/1779 . Some PRs are in-flight now to make this happen automatically, so you don't have to manually deal with it.

smoms commented 9 months ago

@cijothomas thanks. However I understand that the point here is to implement the Http retryable codes as defined by OTel Protocol Spec. If this is the case, I may still need by myself to cover with a retry mechanism other failure situations. For example as specified here. Am I correct?

cijothomas commented 9 months ago

@cijothomas thanks. However I understand that the point here is to implement the Http retryable codes as defined by OTel Protocol Spec. If this is the case, I may still need by myself to cover with a retry mechanism other server error codes. For example as specified here. Am I correct?

Yes.