open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK
https://opentelemetry.io
Apache License 2.0
1.81k stars 633 forks source link

export and shutdown timeouts for all OTLP exporters #3764

Open Arnatious opened 8 months ago

Arnatious commented 8 months ago

Description

This is a solution to several issues related to the current synchronous OTLP exporters.

Currently, OTLP exporters have a couple of pain points

This PR implements a new utility class, opentelemetry.exporter.otlp.proto.common.RetryingExporter, that fixes the above issues. It also significantly refactors the existing OTLP exporters to use this, and extracts retry related logic from their test suites.

Attempts were made to maintain the call signature of public APIs, though in several cases **kwargs were added to ensure future proofing, and positional arguments were renamed to create a consistent interface.

OTLP exporters will create a RetryingExporter, passing in a function performing a single export attempt as well as the OTLPExporter's timeout and export result type.

Example

from opentelemetry.exporter.otlp.proto.common import RetryingExporter, RetryableExportError

class OTLPSpanExporter(SpanExporter):
  def __init__(self, ...):
    self._exporter = RetryingExporter(self._export, SpanExportResult, self._timeout)

  def _export(self, timeout_s: float, serialized_data: bytes) -> SpanExportResult:
    result = ...

    if is_retryable(result):
      raise RetryableExportError(result.delay)
    return result

  def export(self, data, timeout_millis = 10_000, **kwargs) -> SpanExportResult:
    return self._exporter.export_with_retry(timeout_millis * 1e-3, data)

  def shutdown(self, timeout_millis = 10_000, **kwargs):
    ...
    self._exporter.shutdown(timeout_millis)
    self._shutdown = True

Fixes #3309

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Tests were added for the RetryableExporter in exporter/opentelemetry-exporter-otlp-proto-common/tests/test_retryable_exporter.py, as well as for the backoff generator in exporter/opentelemetry-exporter-otlp-proto-common/tests/test_backoff.py. Tests were updated throughout the http and grpc otlp exporters, and retry related logic was removed in all cases but for GRPC metrics, which can be split and therefore needed another layer of deadline checking.

Does This PR Require a Contrib Repo Change?

Answer the following question based on these examples of changes that would require a Contrib Repo Change:

Checklist:

Arnatious commented 8 months ago

I based behavior decisions off the described behavior in https://github.com/open-telemetry/opentelemetry-python/issues/2663#issuecomment-1119218751 - namely, the shortest timeout always wins.

Processor timeout logic is unaffected - if the processor has a shorter timeout or is tracking a deadline for a batch, it passes that to export() and it is respected, if the timeout is longer, the exporter's timeout attribute (set at creation/from env variables) can be hit and cause the export to fail.

I chose to create a helper object rather than splice this into the inheritance hierarchy to avoid having a mixin with __init__, since the exporter needs to have an object-scoped event and lock, and the GRPC exporters already have several mixins with __init__ it'd have to play along with. A unified rewrite of the inheritence hierarchy shared between http and gprc exporters would probably be better.

pmcollins commented 6 months ago

@Arnatious apologies for the delay and thanks for this PR -- improvements to the area you've addressed are super important.

However, perhaps this is too much of a good thing all at once. Do you have availability to break these changes down into smaller PRs? This would make things much easier on reviewers.