Open jlegoff opened 1 year ago
Several vendors already offer self-monitoring capabilities, for instance:
@jlegoff I think it would be very valuable to learn what attributes these vendors capture for self-monitoring purposes.
@open-telemetry/specs-approvers this issue is about adding semantic conventions for recording Collector attributes.
Some questions:
agent.
the right namespace? We could use otelcol.
but using agent.
generalizes it and allows other agents to also use it, which I think is desirable.agent.version
or service.version
as recommended by OpAMP is enough? I think it would be very valuable to learn what attributes these vendors capture for self-monitoring purposes
From what I'm seeing, some vendors include the agent version. The agent type seems to be mostly omitted but it can be deduced from the context. For instance, the GCP ops agent includes an uptime metric containing the version. The namespace of the metric (agent.googleapis.com/agent/
) gives you a hint that this is the Ops Agent.
A couple of examples:
dt.oneagent.agent_type
attribute, but I'm not seeing any reference to the version in the public docsversion
and branch
for the build info metric. The metric prefix (grafana_metrics_enterprise) tells you the agent type.@jlegoff thanks for the examples.
Here is also ECS with its agent.
fields: https://www.elastic.co/guide/en/ecs/current/ecs-agent.html. I think agent.type is similar to what we want.
They also have several other attributes which I think diverge from our recommendation to use service.*
attributes (e.g. I see correspondence of agent.version
->service.version
, agent.id->service.instance.id
).
I am not sure introducing a new set of attributes for version/id just for agents is necessary when we can use service attributes. This may be justified if we think that calling the agent a "service" is wrong for some reason (but I don't know why it would be wrong).
Hi @arminru, I was wondering if you had a chance to look at this. Do you think adding an agent resource type makes sense?
Hey @jlegoff!
Do you think the service.name
and service.version
attributes suggested by @tigrannajaryan above would be suitable or are there reasons not to use them and introduce dedicated attributes instead?
How do you imagine this data to be reported? Would the agents themselves be instrumented with OTel?
Note that there are plans to merge ECS into OTel semantic conventions (see https://github.com/open-telemetry/oteps/pull/222), so there would in any case be discussions about whether the agent.*
attributes defined in ECS should be included or whether they are redundant in OTel.
@arminru regarding service.name
I think it makes sense to use it when the agent is a service, as is the case for the collector. I do think it also makes sense to rely on a specific attribute to know which type of agent is sending the data. For instance, we shouldn't rely on the name of the service being io.opentelemetry.collector
to know it's a collector, because users can change the name. Or they could have several collector services with different purposes and names - but their type should be the same.
I'm less sure about service.version
. I'm thinking their may be reporting agents that are not services, in which case this field would not be set. Though it's true that, for OTEL agent, we have the telemetry.sdk
attributes.
How do you imagine this data to be reported? Would the agents themselves be instrumented with OTel?
In the case of the collector, we can use the self-monitoring capabilites.
Note that there are plans to merge ECS into OTel semantic conventions (see https://github.com/open-telemetry/oteps/pull/222), so there would in any case be discussions about whether the agent.* attributes defined in ECS should be included or whether they are redundant in OTel.
I'm trying to find the agents attributes in the OTEP but I can't find them. In any case, wouldn't it make sense to prefix ECS attributes with aws.ecs
, or something similar, to avoid clashes?
This issue probably needs to be generalized a bit beyont just the needs of agents. Many other piece of technology have a "type", but can also have a more specialized "name" in a particular context they are used.
For example I may be using PostgreSQL database for the purpose of storing online orders information. In that case the type of the service can be "postgresql" and the name of service may be "ordersdb".
I would like to explore the possibility of introducing service.type
as an optional Service attribute. The service.type
would describe the service as it is known by its developers, while service.name
will continue to the name of the service as it is known by its operators. This is primarily applicable to third-party services where who develops the service and who runs it are different people. For first-party the distinction likely is not applicable and in that case either service.type
can be missing or can be set equal to service.name
.
We would recommend using reverse FQDN for service.type
and so for the Collector we would use service.type=io.opentelemetry.collector
and for PostgreSQL we would use service.type=org.postgresql
.
Similarly we may introduce service.distro
. For example PostgreSQL has a bunch of forks and derived databases which this attribute can indicate.
I think this would work well in the case of standalone agents such as the collector, which was the motivation for this issue.
Yeah +1 to this, it's important to the operator group that we can distinguish between a collector OpAMP client and an operator OpAMP client. A respective server functions on different configuration (as per the spec). Something like being able to specify agent.type
would make this possible. It would also be useful if we had defined constants for supports clients (collector and operator).
@arminru I was wondering if there had been any discussion or decisions made here? Would love to get this added to the docket if possible :) thank you!
@jaronoff97 I'm not aware of any further discussion. I'll move it over to the semconv repo where this fits better and might get more attention.
Hey @AlexanderWert I was wondering if there were any updates on this? With the merging of the OpAMP bridge and the OpAMP extension it's become more important to have a semconv to distinguish between these two agent types as part of their identifying attributes.
Submitted this issue to discuss in semconv: https://github.com/open-telemetry/semantic-conventions/issues/554
All, the PR that adds service.type
is created, but I and others have doubts that this is the right way. Please comment on the PR with arguments in favour or against it.
What are you trying to achieve?
I'd like to define semantic conventions for agent resources.
Additional context.
Agents are a key part of the software stack, and need to be monitored just as any other component. Several vendors already offer self-monitoring capabilities, for instance:
The Opentelemetry collector also offers a set of best practices for monitoring.
While agents can be considered services, we might want to add additional attributes to define them in a more specific manner. Possible examples include:
agent.type
:com.dynatrace.one_agent
,com.newrelic.infra_agent
,io.openetelemetry.collector
agent.version
agent.distro
:github.com/signalfx/splunk-otel-collector
Note that this was first discussed in the context of OpAmp in this issue. However, since agent self monitoring happens outside of the context of OpAmp, I think it makes sense to define semantic conventions in this repo.