open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
249 stars 159 forks source link

Add agent resource type #396

Open jlegoff opened 1 year ago

jlegoff commented 1 year ago

What are you trying to achieve?

I'd like to define semantic conventions for agent resources.

Additional context.

Agents are a key part of the software stack, and need to be monitored just as any other component. Several vendors already offer self-monitoring capabilities, for instance:

The Opentelemetry collector also offers a set of best practices for monitoring.

While agents can be considered services, we might want to add additional attributes to define them in a more specific manner. Possible examples include:

Note that this was first discussed in the context of OpAmp in this issue. However, since agent self monitoring happens outside of the context of OpAmp, I think it makes sense to define semantic conventions in this repo.

tigrannajaryan commented 1 year ago

Several vendors already offer self-monitoring capabilities, for instance:

@jlegoff I think it would be very valuable to learn what attributes these vendors capture for self-monitoring purposes.

tigrannajaryan commented 1 year ago

@open-telemetry/specs-approvers this issue is about adding semantic conventions for recording Collector attributes.

Some questions:

jlegoff commented 1 year ago

I think it would be very valuable to learn what attributes these vendors capture for self-monitoring purposes

From what I'm seeing, some vendors include the agent version. The agent type seems to be mostly omitted but it can be deduced from the context. For instance, the GCP ops agent includes an uptime metric containing the version. The namespace of the metric (agent.googleapis.com/agent/) gives you a hint that this is the Ops Agent.

A couple of examples:

tigrannajaryan commented 1 year ago

@jlegoff thanks for the examples.

Here is also ECS with its agent. fields: https://www.elastic.co/guide/en/ecs/current/ecs-agent.html. I think agent.type is similar to what we want.

They also have several other attributes which I think diverge from our recommendation to use service.* attributes (e.g. I see correspondence of agent.version->service.version, agent.id->service.instance.id).

I am not sure introducing a new set of attributes for version/id just for agents is necessary when we can use service attributes. This may be justified if we think that calling the agent a "service" is wrong for some reason (but I don't know why it would be wrong).

jlegoff commented 1 year ago

Hi @arminru, I was wondering if you had a chance to look at this. Do you think adding an agent resource type makes sense?

arminru commented 1 year ago

Hey @jlegoff!

Do you think the service.name and service.version attributes suggested by @tigrannajaryan above would be suitable or are there reasons not to use them and introduce dedicated attributes instead? How do you imagine this data to be reported? Would the agents themselves be instrumented with OTel?

Note that there are plans to merge ECS into OTel semantic conventions (see https://github.com/open-telemetry/oteps/pull/222), so there would in any case be discussions about whether the agent.* attributes defined in ECS should be included or whether they are redundant in OTel.

jlegoff commented 1 year ago

@arminru regarding service.name I think it makes sense to use it when the agent is a service, as is the case for the collector. I do think it also makes sense to rely on a specific attribute to know which type of agent is sending the data. For instance, we shouldn't rely on the name of the service being io.opentelemetry.collector to know it's a collector, because users can change the name. Or they could have several collector services with different purposes and names - but their type should be the same.

I'm less sure about service.version. I'm thinking their may be reporting agents that are not services, in which case this field would not be set. Though it's true that, for OTEL agent, we have the telemetry.sdk attributes.

How do you imagine this data to be reported? Would the agents themselves be instrumented with OTel?

In the case of the collector, we can use the self-monitoring capabilites.

Note that there are plans to merge ECS into OTel semantic conventions (see https://github.com/open-telemetry/oteps/pull/222), so there would in any case be discussions about whether the agent.* attributes defined in ECS should be included or whether they are redundant in OTel.

I'm trying to find the agents attributes in the OTEP but I can't find them. In any case, wouldn't it make sense to prefix ECS attributes with aws.ecs, or something similar, to avoid clashes?

tigrannajaryan commented 1 year ago

This issue probably needs to be generalized a bit beyont just the needs of agents. Many other piece of technology have a "type", but can also have a more specialized "name" in a particular context they are used.

For example I may be using PostgreSQL database for the purpose of storing online orders information. In that case the type of the service can be "postgresql" and the name of service may be "ordersdb".

I would like to explore the possibility of introducing service.type as an optional Service attribute. The service.type would describe the service as it is known by its developers, while service.name will continue to the name of the service as it is known by its operators. This is primarily applicable to third-party services where who develops the service and who runs it are different people. For first-party the distinction likely is not applicable and in that case either service.type can be missing or can be set equal to service.name.

We would recommend using reverse FQDN for service.type and so for the Collector we would use service.type=io.opentelemetry.collector and for PostgreSQL we would use service.type=org.postgresql.

Similarly we may introduce service.distro. For example PostgreSQL has a bunch of forks and derived databases which this attribute can indicate.

jlegoff commented 1 year ago

I think this would work well in the case of standalone agents such as the collector, which was the motivation for this issue.

jaronoff97 commented 1 year ago

Yeah +1 to this, it's important to the operator group that we can distinguish between a collector OpAMP client and an operator OpAMP client. A respective server functions on different configuration (as per the spec). Something like being able to specify agent.type would make this possible. It would also be useful if we had defined constants for supports clients (collector and operator).

jaronoff97 commented 11 months ago

@arminru I was wondering if there had been any discussion or decisions made here? Would love to get this added to the docket if possible :) thank you!

arminru commented 11 months ago

@jaronoff97 I'm not aware of any further discussion. I'll move it over to the semconv repo where this fits better and might get more attention.

jaronoff97 commented 10 months ago

Hey @AlexanderWert I was wondering if there were any updates on this? With the merging of the OpAMP bridge and the OpAMP extension it's become more important to have a semconv to distinguish between these two agent types as part of their identifying attributes.

tigrannajaryan commented 9 months ago

Submitted this issue to discuss in semconv: https://github.com/open-telemetry/semantic-conventions/issues/554

tigrannajaryan commented 6 months ago

All, the PR that adds service.type is created, but I and others have doubts that this is the right way. Please comment on the PR with arguments in favour or against it.