open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.05k stars 2.35k forks source link

Add data structures to model entity events as log records. #23565

Closed tigrannajaryan closed 1 year ago

tigrannajaryan commented 1 year ago

From design document: https://docs.google.com/document/d/1Tg18sIck3Nakxtd3TFFcIjrmRO_0GLMdHXylVqBQmJA/edit#heading=h.v4ilwdkncxe

Entity Events

EntityState

Indicates the entity's current state. Note that the state is cumulative, i.e. this event describes the full state of the entity as it is at a certain moment of time.

Field name: Timestamp

Type: Timestamp, uint64 nanosecods since Unix epoch

Description: The time since when the entity state is described by this event. When the entity's state is changed it is expected that the source will emit a new EntityState event with a fresh timestamp and full list of values of attributes and relationships. The time is measured by the origin clock. This field is optional, it may be missing if the time when the change happened is unknown.

Field name: Id

Type: key/value pair list

Description: Entity identifier. MUST not change during the lifetime of the entity. Can contain one or more key/value pairs. This field is required. If the list is empty the event is malformed and should be ignored.

All key/value pairs in the Id are also considered to be attributes of the entity. The key/value pairs respect OpenTelemetry semantic conventions for resources.

Note: in this phase 1 design the entity has only one Id (composed of one or more key/value pairs). We have also discussed in the past the ability for entities to have multiple Ids (each Id itself being a list of key/value pairs). The capability for entities to have multiple Ids is out of scope for phase 1 design.

Field name: Type

Type: string

Description: The type of the entity. MUST not change during the lifetime of the entity. This field is required. If the field missing or empty the event is malformed and should be ignored. Typically set equal to the prefix used by attributes of the semantic conventions for the particular concept in OpenTelemetry (e.g. "service" for Service, "k8s.pod" for Kubernetes Pod, etc).

Field name: Attributes

Type: key/value pair list

Description: Entity attributes. MAY change over the lifetime of the entity. The specified attribute values are effective starting from the time specified in the timestamp field. This field is optional. If it is missing or the list is empty then the entity has no attributes other than the ones contained in the id. The key/value pairs respect OpenTelemetry semantic conventions for resources.

EntityDelete

Indicates that an entity is deleted.

Field name: Timestamp

Type: Timestamp, uint64 nanosecods since Unix epoch

Description: Time when the entity is deleted measured by the origin clock. This field is optional, it may be missing if the timestamp is unknown.

Field name: Id

Type: key/value pair list

Description: entity identifier. Can contain one or more key/value pairs. This field is required. If the list is empty the event is malformed and should be ignored.

Mapping to Log Records

Entity events don't yet have a first-class representation in OpenTelemetry. However, they can be temporarily/experimentally mapped to Log records to allow us to work with entity data, do experiments and research/iterate on the concept of entities. This will allow to pass the entity events in the log pipeline and make them available to processors and exporters. This section defines how Entity events can be represented as Log records.

To improve processing efficiency of received batches the following Scope attribute must be set for all log records representing entity event: otel.entity.entity_event=true

EntityState Log record

Log Record field Value
Timestamp EntityState.Timestamp
Attributes["otel.entity.event.type"] "entity_state"
Attributes["otel.entity.id"] EntityState.Id
Attributes["otel.entity.type"] EntityState.Type
Attributes["otel.entity.attributes"] EntityState.Attributes

EntityDelete Log record

Log Record field Value
Timestamp EntityDelete.Timestamp
Attributes["otel.entity.event.type"] "entity_deleted"
Attributes["otel.entity.id"] EntityChange.Id
djaglowski commented 1 year ago

Apologies if this has been covered elsewhere, but can you help me understand the difference between a resource vs an entity, and whether or not these concepts are meant to be mutually exclusive? Is an entity a component that can be described by attributes and state, but doesn't qualify as a resource because it doesn't emit telemetry? Conversely, if an entity can emit telemetry, then why not model it as a resource?

These attributes appear to describe a thing that could be a resource:

tigrannajaryan commented 1 year ago

We typically put multiple entities in a Resource. Here is some problems we have with the current Resource and why we want an Entity that is is defined differently from the Resource.

Problem 1: Commingling of Entities

The Resource is defined as a representation of the entity.

A Resource is an immutable representation of the entity producing telemetry as Attributes.

Note it speaks about one particular entity. In practice we commingle multiple entities into one Resource. The spec shows a clear example that talks about multiple entities (Process, Container, Pod, etc) in one Resource:

For example, a process producing telemetry that is running in a container on Kubernetes has a Pod name, it is in a namespace and possibly is part of a Deployment which also has a name. All three of these attributes can be included in the Resource.

The problem with such usage is that by looking at the Resource attributes it is impossible to tell which of the represented entities is the entity, i.e. the entity which produced the telemetry.

Problem 2: Lack of Precise Identity

The Resource is one set of attributes, which contains all attributes of all entities that the Resource represents. It is impossible to tell which of these attributes identify the entity (or entities) and which are non-identifying, i.e. purely descriptive.

This lack of precise identity makes it difficult or impossible to identify the same entities reported in different Resources.

Problem 3: Lack of Mutable Attributes

Resource is defined to be immutable in the OpenTelemetry SDK. This does not align well with the fact that non-identifying attributes of entities may change over time. For example OpenTelemetry Collector collects data about Pods and adds Pod labels as Resource attributes. Pod labels are mutable in Kubernetes and can change over time, while the Pod's identity remains immutable. Here is another example where mutable Service attributes are desirable.

With the current definition of the Resource we are forced to either leave out any attributes that may ever change over time or violate the spec definition.

Additionally, OpenTelemetry currently lacks the ability to provide resource attributes that require some kind of delayed lookup that may fail (see this issue). This required, e.g. passing environment variables for k8s container name and various downward-api values for an OpenTelemetry SDK to appropriately report this resource.

In reality OpenTelemetry SDKs can also easily violate the definition as soon as we consider mutability from recipients perspective. SDKs only guarantee immutability during a single process session. As soon as the process is restarted and the SDK is newly initialized there is no guarantee that the Resource will have the same set of attributes (e.g. because process.id can be one of the Resource attributes).

It is clear that the strictly "immutable" definition of the Resource is not sufficient for what we are trying to model.

Problem 4: Metric Cardinality Problem

[Copied from Josh Suereth's description] Every attribute in an OpenTelemetry Resource, according to the metric datamodel, is used to determine the identity of metric. Given known issues in metric time-series database implementation around cardinality, this can cause major issues if Resources are allowed to leverage high cardinality attributes.

Given many Resource attributes semantic conventions today were defined for the tracing instrumentation, we do find many high cardinality definitions, e.g. the Process resource includes pid and and parent_pid, which are known to churn between instances of an application and would lead to higher cardinality streams.

Many metric backends are simply erasing resource attributes from metrics to workaround the issue. Here's an example solution for prometheus, and another proposal for yet another point-fix for prometheus.

However, these workarounds prevent Metrics users from regaining descriptive attributes (and benefits) of current OTEL Resource detection.

smith commented 4 months ago

In case you've come across this looking for information about EntityState and EntityDelete events in OpenTelemetry, these are now described in the Entities Data Model, Part 1.