New component: Diagnostic Extension

MovieStoreGuy commented 4 weeks ago

The purpose and use-cases of the new component

As a means to help improve the ability to understand issues users are experience with the collector, having a standard means of gathering that information and other useful telemetry (as an example but not limited to sanitised configuration, component health, collector profiles, crash reports, installed components, version, distribution).

This would help investigation where issues are hard to reproduce, in environments that are not easy to replicate, and ensure that when it comes to reporting issues (either to a vendor or the project) that they have all the details they would need.

Example configuration for the component

I am not completely tied to this and I welcome feedback here.

extensions:
  diagnostics:
    track: [ runtime, capacity, errors ]
    history: <duration>
    interval: <duration>

I've intentionally opt for not having an endpoint to report to within my initial idea to accomodate user preferences who may not what to start off with sharing diagnostic data automatically, but allow that to be more of a support/investigation workflow that can include extra context around the problems being observed.

Telemetry data types supported

N/A

Code Owner(s)

@moviestoreguy

Sponsor (optional)

No response

Additional context

There has been enough times that users have reported issues that having additional context such as golang pprof, the resolved configuration, and seeing what is the reported errors for that period would have been of great benefit.

I don't believe we should also expect users to be knowledgable on how to capture this data themselves so providing a consistent way across the project and distributions makes for a repeatable, open, process that everyone can adopt and contribute to.

MovieStoreGuy commented 4 weeks ago

For those whom have had to investigate issues with the collector, I'd love your feedback on what helped you resolve those issues and what data you had available to help resolve it.

MovieStoreGuy commented 4 weeks ago

I also wanted to highlight that https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/16598 was previous suggested and could be included in this design.

djaglowski commented 4 weeks ago

I'm not strictly against the idea but before creating a novel observability mechanism, I'd rather see us double down on the observability principles we want others to rely on. Which of these needs can be solved by improving the telemetry which the collector emits about itself?

sanitised configuration (logs? assuming sanitization includes masking sensitive info)
collector profiles (profiles?)
crash reports (logs?)
capacity (metrics?)
errors (logs?)
installed components, version, distribution (logs?)

We should also consider the overlap with OpAMP. Configuration & details about the binary should be (and in many cases already are) available to the OpAMP extension as well, so if nothing else I hope we can find a common mechanism for gathering and passing these details along to extensions. Another way to lean on OpAMP would be to implement an OpAMP server that can run locally and just capture the information that the OpAMP extension already provides.

MovieStoreGuy commented 3 weeks ago

Yeah, I agree with you that there is areas that could be improved but I also don't want to impose this on everyone where if you were to include a means to track diagnostics that it doesn't come with additional cost of monitoring the collector itself.

My ideal scenario is keeping this adhoc (or on demand) is that the collector has it ready with some history (configurable). OpAmp feels like the correct place for this to exist, but I also want a means to query the collector directly (maybe with a subcommand otelcol diagnostics capture?).

open-telemetry / opentelemetry-collector-contrib