open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.73k stars 2.16k forks source link

New component: DNS Cache Extension for OpenTelemetry #32410

Open philchia opened 2 months ago

philchia commented 2 months ago

The purpose and use-cases of the new component

The DNS Cache Extension is designed to maintain the operability of OpenTelemetry exporters in the event of DNS resolution failures. By caching DNS resolutions, this extension ensures that telemetry data (traces, metrics, logs) can continue to be exported to configured endpoints even when DNS services are temporarily unavailable.

Example configuration for the component

refresh_interval:   "1m"
resolve_timeout:    "5s"
clear_unused:       false
persist_on_failure: true

Telemetry data types supported

Is this a vendor-specific component?

Code Owner(s)

philchia

Sponsor (optional)

philchia

Additional context

The proposed DNS Cache Extension enhances resilience against DNS failures by implementing a caching mechanism directly into the OpenTelemetry's networking layer. Here’s a summary of how it works:

Initialization:

Upon initialization, creates a dnscache Resolver with a specified timeout for resolving DNS queries. It also configures a custom http.Transport to intercept standard http. DefaultTransport. This transport uses the DNS resolver to look up and cache IP addresses for given hostnames.

Operation:

The custom transport attempts to resolve hostnames using cached entries first. If the cache does not have the entry or it fails, it falls back to a real-time DNS lookup and updates the cache with the new result. This ensures that even during DNS outages, the IP addresses stored in the cache can still be used to establish network connections, thereby allowing telemetry data to continue being exported without interruption.

Refreshing the Cache:

The extension runs a background routine that periodically refreshes the DNS cache based on a configurable interval (RefreshInterval). It utilizes the ResolverRefreshOptions to clear unused cache entries and to persist useful entries even in the case of intermittent DNS resolution failures.

Shutdown:

On shutdown, the extension restores the default network transport to ensure that any changes made do not persist beyond the lifetime of the application, maintaining clean state management.

github-actions[bot] commented 3 weeks ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.