open-telemetry / opamp-spec

OpAMP Specification
Apache License 2.0
108 stars 34 forks source link

Decide if the protocol can be used to manage SDKs #16

Closed tigrannajaryan closed 2 years ago

tigrannajaryan commented 2 years ago

Can we use the same protocol to manage Otel SDKs?

Currently applicable parts:

tigrannajaryan commented 2 years ago

One thing that we will need to change if we want to manage SDK is the following current spec requirements:

The Resource in the reported telemetry SHOULD describe the Agent in the following way:

  • service.instance.id attribute SHOULD be set to Agent's instance_uid that is used in the OpAMP messages.
  • service.name attribute SHOULD be set to Agent's type, matching the value of agent_type field in AgentDescription message.
  • service.version SHOULD be set to Agent's version, matching the value of agent_version field in AgentDescription message.

These are written with the assumption that the Agent is a standalone executable and we can call it a "Service".

However this is not true for embedded agents, such as Otel SDK. Calling an Otel SDK a "Service" and requiring it to output its own metrics with "service.*" attributes would be very confusing. For an Otel SDK a more natural way would be to emit its own metrics for example with the following attributes:

To make such usage possible with OpAMP we need to modify OpAMP spec in the following way (this is just one possible way, there are probably others):

@open-telemetry/opamp-spec-approvers what do you think?

tigrannajaryan commented 2 years ago

@MrAlias @jmacd what do you think about the above comment? Would it make sense for Otel SDK to follow the conventions like that and use OpAMP to receive the config?

carlosalberto commented 2 years ago

@tigrannajaryan Think this should be brought to the maintainers call, just to get an initial vibe from the community? Or too early perhaps?

tigrannajaryan commented 2 years ago

@carlosalberto it is probably early for detailed discussion but it is not a bad idea to create more awareness at least. Let's mention this in the next maintainers call.

tigrannajaryan commented 2 years ago

@open-telemetry/opamp-go-approvers I need your help on this. Please comment and tell what you think.

I submitted a draft PR which shows the relevant changes to allow OpAMP to work with SDKs: https://github.com/open-telemetry/opamp-spec/pull/48

pmm-sumo commented 2 years ago

This is interesting idea. I am somewhat concerned that it will increase the complexity, but perhaps it's worth it. It would make management of common parameters (e.g. sampling rate, enabled/disabled instrumentations, etc.) much more convenient

tigrannajaryan commented 2 years ago

This is interesting idea. I am somewhat concerned that it will increase the complexity, but perhaps it's worth it.

To understand how much complicated this is I modified my private prototype implementation according to this change and found that it increases the implementation complexity only a little bit. I ended up changing less than a 100 lines of code combined in the client and server implementation (compared to the implementation that uses a fixed agenttype/agentversion pair of fields).

Aneurysm9 commented 2 years ago

I think we absolutely need a standard configuration format for SDKs and I think that once we have that it would be nice to be able to manage that configuration through the same remote management system as the collector.

jmacd commented 2 years ago

I support the proposals in https://github.com/open-telemetry/opamp-spec/issues/16#issuecomment-983077725 that, roughly speaking, accepts OpAMP as a way to manage client SDKs by adding support for variable telemetry "schemas". Could we use the schema_url aspect of the OTel client configuration as a way to describe the agent's telemetry schema, too?

Status Updates (unclear)

About status updates, this somehow tickles the idea that we could use one of the existing data (OTLP logs, metrics, or spans) to convey status. I went through an exercise of making a "very observable" agent last year, using OTLP for https://github.com/lightstep/opentelemetry-prometheus-sidecar#diagnostics.

In that package, I used an OTLP span to convey a status report. Could a telemetry schema be developed for status reports, instead of a dedicated field? Maybe this is a good-in-theory, bad-in-practice suggestion, but may I ask -- what benefit is there in having status reports be a new data type?

tigrannajaryan commented 2 years ago

In that package, I used an OTLP span to convey a status report. Could a telemetry schema be developed for status reports, instead of a dedicated field? Maybe this is a good-in-theory, bad-in-practice suggestion, but may I ask -- what benefit is there in having status reports be a new data type?

@jmacd "Status Reports" as they are defined in OpAMP right now are identity reports plus the state of "applying" what the Server offered the Agent to do (e.g. "I applied that config that you gave me successfully"). The Agent self-describes its identity and the identity values can be used by the Server to tailor the rest of its behavior appropriately. Perhaps the "Status Report" name is not conveying this properly (this was raised before, but I wasn't able to find a better name). We could report this same data in OTLP form, but it would require defining lots of semantic conventions and I don't see the benefit of doing that, it would make things more convoluted and less efficient, not to mention that OTLP does not define WebSocket as a transport, which is essential to OpAMP's certain capabilities.

As opposed to "Status Reports" the "telemetry" part of Agent's state is reported via OTLP/HTTP and OpAMP merely tells the Agent where the OTLP destination is.

Could we use the schema_url aspect of the OTel client configuration as a way to describe the agent's telemetry schema, too?

Absolutely. Nothing prevents that and we should encourage the Agents to include schema_url in their own telemetry.

jkowall commented 2 years ago

As per the call today, the downside of a persistent connection is not only a problem with scaling a centralized management system, but can also prevent caching of configurations on things like CDNs. It can also create a single point of failure around configuring dynamic data collection during an outage which affects the monitoring system itself. I still think the UX being instant and responsive outweigh the scaling challenges which this approach could introduce.