Should event fields be part of the global attribute registry?

open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains

Apache License 2.0

271 stars 174 forks source link

Should event fields be part of the global attribute registry? #505

Closed martinkuba closed 7 months ago

martinkuba commented 1 year ago

This issue is intended as a discussion about the approach of defining event fields.

Background

Events are a new concept in OpenTelemetry that is in the process of being defined. An event will consist of a name and a body (payload). The name will be represented as the event.name attribute. The body payload will typically consist of a collection of fields (name/value pairs). An event may also have additional attributes that are separate from the body.

The discussion about field names in this issue refers to fields that are part of the event body, not the additional attributes. Additional attributes will follow the same rules as attributes on spans.

Discussion

Should fields in the event body be defined in the global attributes registry?

Here is an example event representing a browser page load (from this PR). The field names are not namespaced and their definition would be only in the scope of this specific event.

{
    "attributes": {
        "event.name": "browser.page_view" 
    },
    "body": {
        "referrer": "",
        "type": "",
        "title": "",
        "url": "",
    }
}

Pros:

shorter names
simpler to define (because they would be vetted only by experts familiar with the event, no discussion necessary how these fit in the global registry)
global registry will not be polluted with many one-off attributes

Cons:

some fields would be duplicated
definition of event fields is handled differently than attributes

Fields defined in the attribute registry

If the fields for the same event were defined in the registry, they could hypothetically look like this:

{
    "browser.document.referrer": "",
    "browser.page.type": "",
    "browser.page.title": "",
    "url.full": "",
}

Note that both url.full is an existing attribute that is re-used in this event. However, the browser.document.referrer also contains a full URL, but must have a different name here to distinguish from page URL and also to provide the meaning of the field.

Pros:

consistent with how metrics use attributes
field names could be reused without duplicating definition of the format and content

Cons:

there might be many fields in the registry that are used only once
- this might make the registry harder to manage
- might make it harder to define payloads of new events
- larger payload size on the wire

Hybrid

There is a third option where an event payload could contain both fields defined only for the specific event and attributes from the global registry.

{
    "referrer": "",
    "type": "",
    "title": "",
    "url.full": "", 
}

Event fields defined externally

There might be cases when it's better to not define the content of a payload in OpenTelemetry semantic conventions, but instead link to an external definition. An example of this might be the data from the W3C ResourceTiming API. In this case, if the field definitions were duplicated in the attributes registry, they may get out of sync with the external definition.

If fields defined externally are used in the event body, they would not be namespaced.

MSNev commented 12 months ago

Just to document my take on this and this will include some generic definitions on how "events" should be defined.

By Default for the payload (fields)

The event fields SHOULD not be required to be fully namespaced (and MUST not be mandated)
The fields SHOULD not be required to be in the global registry (ie. an event may include whatever fields it needs to describe its payload)
The fields included into the payload (for OpenTelemetry defined events) MUST be defined in semantic conventions, with the event.name and each field MUST be described (required / optional) and any description.
Each field MAY reference out to a definition of the global semantic convention, if the field needs to conform to an already well defined semantic convention, but does not want / need to use the same fully qualified name (eg. due to runtime constraints , there are multiple fields that would confirm to the same definition or because the field names are defined by an external definition but OpenTelemetry has already documented the definition)
Different events MAY include the same field name within their own payload and this DOES NOT infer that they are the same concept as all fields of a payload MUST be considered to be unique to each specific event.name. However, as identified above the field definition above MAY link out to a global semantic convention to provide this weak assertion (if required).
- Even with the assertion fields of different events SHOULD NOT generally be a candidate for grouping (unless otherwise specified between specific events event.name, eg. Event A MAY defined the same field as Event B AND define these as representing the same value, so they could be used for some form of grouping / metric (if required).

Optionally

Each domain specific event event.name is free to define its own set of payload fields
Each field MAY use the fully namespaced global registry defined semantic convention, it is up to the domain specific group defining the event
The payload of an event is NOT REQUIRED, again, it's up to the specific event domain to determine how to best represent the event (eg. there may be some events that do not require a payload and thus do not need to specify an empty payload)
As part of the event definition (whether defined by OpenTelemetry, a vender or application event), MAY identify top level attributes (ie not withing the payload) that MAY be expected to be included with the event
- These additional (globally namespaced) attributes MAY also be defined as required / optional (as required by the event definition)

Optional (additional) Attributes (not the payload fields)

These attributes SHOULD obey the normal Semantic Convention rules for Attributes
An event MAY include as many additional attributes as required by the end-user of that event

jack-berg commented 11 months ago

I’ve stated my opinion on this before and engaged in lengthy discussions in my Event payload proposal document. I’ll recap my stance and offer some old and new arguments:

Event fields should be part of the global attribute registry and should be subject to the same naming rules as other attributes.

In a nutshell, the discussion comes down to a question of whether the constrained environments (limited network bandwidth, CPU, package size) of certain client scenarios should influence how OpenTelemetry models semantic conventions for Events (and presumably other signals as well). The answer should be no. We should clearly decouple modeling the shape of telemetry (i.e. semantic conventions) from how telemetry is sent over a network. Accepting that Event fields are different from attributes on other Signals would result in:

Additional context required to interpret fields. The most important characteristic of the global attribute registry is that it gives the ability to interpret all attributes in the same way regardless of the signal or context. Special casing Events breaks this, requiring the user and backends to understand the semantics of a particular field in the context of the event.name it is attached to. This is a bad user experience and backend experience.
A lack of ability to correlate fields between signals. Spans, Metric, and Log attributes require namespaces and any overlap between Event fields and Span / Metric / Log attributes would be inconsistent. I.e. Event field “foo” would have to be something like “namespace.foo” on a Span / Metric / Log. This breaks part of the core design philosophy of OpenTelemetry, which prioritizes cross cutting concerns with all observability signals sharing certain concepts and working together to form a complete picture.
A performance-justified race to zero. If constrained environments didn’t exist, this issue wouldn’t exist. It would be obvious to treat Event fields the same as other attributes. Letting the constraints of a particular type of environment jeopardize the design coherence of the broader semantic conventions is a slippery slope. If we start omitting namespaces from Event fields justified by smaller payloads, why wouldn’t we take it a step further and question the length of the name of every field? E.g. we could abbreviate “click” with “c” to save 4 characters. Why not take it a step further and apply the same logic to Span / Metric / Log attributes? Surely the constrained environment conditions that justify short field names for Events also prefer short field names for Spans and Metrics? We’re obviously going to include Spans / Metrics / Logs in constrained environments and will have to solve protocol level issues related to payloads for these signals. We should solve the protocol problem and apply the solution (whatever it may be) uniformly to Spans, Metrics, Logs, and Events. We should clearly separate the problem of transmitting data in constrained environments from the data modeling problem of semantic conventions. Semantic conventions should prioritize coherence and clarity, naming fields in a way that helps humans understand their meaning rather than to minimize bytes transmitted.

Another aspect of the discussion is whether including Event fields in the global attribute registry is unnecessarily cumbersome for people looking to contribute Event semantic conventions. Do we want a bunch of domain-specific one-off attributes? Do we want contributors of Event semantic conventions to have to understand the universe of attributes such that they re-use an existing attribute when the concept has the same semantics? Do we want domain experts to have to define their Events in collaboration with semantic convention maintainers? Yes, yes, and yes. The process of defining semantic conventions is intended to be low overhead and accommodate domain specific concepts. Defining Event semantic conventions should be subject to the same process and scrutiny as any other convention. There’s no difference between a group wanting to define domain specific Event conventions and wanting to define domain specific Span conventions. Both require defining how domain specific concepts fit into semantic conventions.

Up to this point, I’ve been expressing my own opinion. However, we talked about this issue in the 11/15/23 TC meeting and there was agreement among the present TC members that Event fields should follow the same process as other attributes defined in semantic-conventions. We discussed that OTLP as currently defined may not be suitable for certain extremely resource constrained environments, and that the Client WG in collaboration with the Event WG should work to prototype changes to the protocol to accommodate these environments. Ideas included various versions of the symbol dictionary, or using a columnar format like the OpenTelemetry Arrow Project, or even a stateful domain specific protocol. Note that this opinion from the TC reflects consistency on a previous vote about whether metric attributes should be namespaced. The last comment on that issue is particularly relevant here:

I think the main decision here was that they are conceptually namespaced and that's how we will describe them in semantic conventions. On this basis, any transmission optimizations can still be built.

An interesting aspect of this problem is that it is not limited in scope to the Client WG, or to Event WG. It involves questions of data modeling that squarely intersect with the Semantic Convention WG, it has implications on how the different telemetry signals interact with each other which is a clear specification level issue which is maintained by the TC, and it has implications on all maintainers since Events will be used outside of the client instrumentation space.

So far, we’ve heard a lot from the perspective of the Client and Event WG, and we’ve discussed this in the TC. I’m interested in whether there are any additional perspectives from other semantic-conventions contributors (other than the maintainers who are also on the TC) and from language maintainers? Are all the arguments on the table or are there additional things we haven’t yet considered?

MSNev commented 11 months ago

Up to this point, I’ve been expressing my own opinion. However, we talked about this issue in the 11/15/23 TC meeting and there was agreement among the present TC members that Event fields should follow the same process as other attributes defined in semantic-conventions. We discussed that OTLP as currently defined may not be suitable for certain extremely resource constrained environments, and that the Client WG in collaboration with the Event WG should work to prototype changes to the protocol to accommodate these environments. Ideas included various versions of the symbol dictionary, or using a columnar format like the OpenTelemetry Arrow Project, or even a stateful domain specific protocol. Note that this opinion from the TC reflects consistency on a previous vote about https://github.com/open-telemetry/semantic-conventions/issues/51 The https://github.com/open-telemetry/semantic-conventions/issues/51#issuecomment-1608853732 on that issue is particularly relevant here:

I think the main decision here was that they are conceptually namespaced and that's how we will describe them in semantic conventions. On this basis, any transmission optimizations can still be built.

I want to document my VERY STRONG disagreement with this approach on several levels that I'm not quite sure where to start...

Strongly agree:

OTLP as currently defined may not be suitable for certain extremely resource constrained environments
Yes, we should look at creating a "better" (smaller) transport payload, but not one that requires any form of special dynamic tables to be calculated during payload serialization. In my opinion OTLP (could) be made better (and slightly more workable) by introducing a few tweaks and defaults (although these may not be supported by existing protobuf definitions / implementations)

Strongly disagree:

Event fields should follow the same process as other attributes defined in semantic-conventions.
- Not all events will be defined by OpenTelemetry
- There will be 1000's more "Custom" events then there will be OpenTelemetry defined events and requiring that every "Custom" event MUST have every "field" for the custom event prefixed is just not viable for most applications.
- This will "force" application custom level events to end up using AnyValue just to avoid the prefix for the EXACT same reasons that I'm pushing back on this for the OpenTelemetry defined events.
Prefixing every field of an event with name of the event (ie. the event.name) is what you are proposing here as that is the ONLY viable option, prefixing with just the namespace of the event WILL cause clashes with different fields in the same domain namespace group.
- Bundle sizes in some languages (runtime constraints on the size of the code - ie browsers), strings are not compressable
- Prefixing values to runtime defined fields just causes additional memory (and CPU) usage which are then batch (holding onto that in memory) and then during transport serialization this generates further excessive bytes (further compounding the OTLP transport problems), and using some sort of client side "compressing" (symbol disctionary, etc) does not help this situation when you have LOTS of unique events with unique namespaces as every field name is now unique and has to be listed in the symbol table (where it's most likely only EVERY used once).
Prototyping some form of client side "compression" during transport
- Transport payload serialization is a SUPER CRITICAL time sensitive operation and additional "time" taken to "compress" MUST be avoided at all costs.
- There are also situations where the final payload "size" is limited (by the runtime) and can be non-deterministic (a maximum of 64kb but may be a lot smaller) so "if" there is some form of "compression" (symbol table, gzip, etc) this can require multiple "passes" (because if the payload is too large you have to perform the entire operation again (potentially checking the usage of every field to see if any "symbol" should be included), and this may need to be repeated multiple times.
Serialization / deserialization can also represent security issues for the libraries that are performing either, to the point that some organizations (including Microsoft) require to usage of only certified serialization libraries, with the preference for JS runtimes to use the platform JSON implementations, which also has the advantage of being the "fastest" (using the native serialization) conversions and therefore having the lowest impact (not withstanding the above comments about the creating of the long strings and the current OTLP format) on any "waiting" user -- server environments DON'T have this issue ALL client environment DO (or CAN - some can offload to another thread -- but not all).

Possible "tweaks" which would make the "logical" prefixing of an event payload "fields"

All of the payload fields for the events are "grouped" as the payload for an event and when present in the global registry they are "listed" with the prefix, BUT are NOT sent in this manner, they are still SENT (without the prefix) in a payload Attribute (event.data / body)

A "general" OTLP transport improvement(s)

All "Attributes" with the same namespace (including ALL of the ones defined in the registry) as sent in a containing AnyValue (should really be a new special type) where the name of the top level Attribute is the namespace with the value being a Map of attributes with the namespace removed (just the name)
- eg. All http conventions would be in a sub-object http: { method: "GET", ... } etc
Remove the "need" for every attribute to have it's type (at least for the JSON representation - binary formats will always need to define the "encoding")). That way "strings", "numbers" (caveat with int / float), "arrays" "map" (represented as an object) and all other types would still "need" there "type. There is also a "special" need required for when transmitting fields that "may" requied metadata, to identify fields that may contain PII information so that backends can handle as required, depending on the provider / vender.

I'm going to (try to) stop now and I'm sure we will (and need) to discuss further in the Event WG (and other SIGs), for reference I have create a draft white paper (not yet public -- being reviewed and may not be able to be fully public) on why Microsoft is not yet using OpenTelemetry for the Web (some of this is summarized above, and I've talked in lots of SIG's about the reasons), this draft internal paper is 17 pages long with all of the issues and what is required.

And for clients breaking things down and trying to address (and / or punt) different concerns down the road doesn't work because they are all inter-related to different degrees because

the specifications drive the internal representations in-memory representation and the code required to support
the transport format along with the specification also drives the in-memory representations and the code required to support
the combination of the above drives the conversions (DTO's) required to go from the user (application) input, to the enforced specifications (readonly in some cases) and the into a form that can be serialized and the serialization process can then require further "processing"

To Date OpenTelemetry (appears) (IMHO) to have been driven by the "final" storage format used by many organizations and then creating wrappings around these "storage" formats for creating on servers to enable the data to be quickly dumped, and for high volume servers this is an excellent approach -- it just doesn't work for ALL client scenarios / languages.

jack-berg commented 11 months ago

There will be 1000's more "Custom" events then there will be OpenTelemetry defined events and requiring that every "Custom" event MUST have every "field" for the custom event prefixed is just not viable for most applications. This will "force" application custom level events to end up using AnyValue just to avoid the prefix for the EXACT same reasons that I'm pushing back on this for the OpenTelemetry defined events.

No. The attribute naming rules use strong normative language OpenTelemetry authors, but we can't enforce what application authors do for custom events. At best, we can define recommendations. For events defined outside OpenTelemetry and bridged over, we should minimally transform the original payload with mapping rules. The naming rules for are OpenTelemetry semantic conventions.

Bundle sizes in some languages (runtime constraints on the size of the code - ie browsers), strings are not compressable Transport payload serialization is a SUPER CRITICAL time sensitive Prefixing values to runtime defined fields just causes additional memory (and CPU) usage which are then batch (holding onto that in memory) and then during transport serialization this generates further excessive bytes

We shouldn't jeopardize the coherence and clarity of the telemetry semantic conventions for the requirements of extremely constrained environments. We must decouple data modeling from transmission, and work to accommodate constrained environments with transmission level solutions.

Possible "tweaks" which would make the "logical" prefixing of an event payload "fields". All of the payload fields for the events are "grouped" as the payload for an event and when present in the global registry they are "listed" with the prefix, BUT are NOT sent in this manner, they are still SENT (without the prefix) in a payload Attribute (event.data / body)

This is an example of a "transmission optimization" described in this comment, which is compatible with the notion of treating event fields as attributes.

Remove the "need" for every attribute to have it's type (at least for the JSON representation - binary formats will always need to define the "encoding")).

You're describing a breaking change to the OTLP JSON protobuf encoding. A variation which doesn't include the type is possible with a new protocol. This is still possible, since as this comment points out, "any transmission optimizations can still be built".

the specifications drive the internal representations in-memory representation and the code required to support the transport format along with the specification also drives the in-memory representations and the code required to support the combination of the above drives the conversions (DTO's) required to go from the user (application) input, to the enforced specifications (readonly in some cases) and the into a form that can be serialized and the serialization process can then require further "processing"

It sounds like your issues go well beyond the representation of Event fields, bubbling into the SDK specification. The SDK specification are stable of course, and its hard to imagine evolving them in a way that would accommodate the extreme requirements you describe. If the issues exist for all client use cases, it may suggest that a domain specific SDK might need to be developed for the specialized conditions which is not subject to the normal SDK specification language. If the issue only occurs in client application of a certain scale (or under certain constraints), it may be worth considering developing a custom SDK for your specific use case. The OpenTelemetry decoupling of API and SDK was designed for this purpose.

martinkuba commented 11 months ago

My view is the following - I agree that we should decouple the conversation about data modeling and performance overhead. There is no guarantee that attribute names without namespaces will be enough to address performance requirements of some applications. This should be addressed by choosing different exporters/protocols and perhaps even different SDK implementations, and application owners can choose which makes the most sense for their requirements. I also think that we need to just focus on data modeling at this point to make progress.

With that said, I am not 100% convinced that event payload fields need to be in the same global registry as regular attributes. Currently, attribute semantic conventions have a global scope, while event fields are scoped to the particular event. The combination of the event name and a field name should be enough to understand what the field represents. And in order to enable reusability, perhaps it should be allowed to use an attribute from the global registry as an event field.

I want to give one example. Assume that there is an event that represents navigation in a browser environment. We want to capture the previous URL and the current URL. We cannot reuse url.full here because its name does not capture the meaning of those fields (even though the value it holds does). The fields should be something like previouls.url and current.url. Does it add a value to add this to the global registry?

I would like to hear from @open-telemetry/specs-semconv-maintainers and @open-telemetry/specs-semconv-approvers about this topic. If there is a consensus that event fields should be in the global registry from this group and the TC, then I will be in favor as well in order to make progress.

pyohannes commented 11 months ago

Should fields in the event body be defined in the global attributes registry?

If the answer to this is "no", where else would those fields be defined? Wouldn't they be defined at all? Or would there be another event-specific registry independent of the semantic conventions attribute registry?

AlexanderWert commented 11 months ago

I agree with everything that @jack-berg wrote in https://github.com/open-telemetry/semantic-conventions/issues/505#issuecomment-1815289100.

Some additional perspective:

We are merging Elastic Common Schema (ECS) into OTel semantic conventions
A big portion of ECS (and the value it provides to the ECS community) comes from the area of describing events and their payload cross signal and cross use cases. And it's specifically useful to have a common, standardised way of mapping data for common, repetitive use cases into semantic fields that have a common meaning also across single use cases (e.g. a client.geo.* has always the same meaning independent whether it's on a span, comes from an NGINX access log, or is attached to a mobile error event).
So, as a learning from the ECS and it's community (that we are merging into the OTel community with the ECS donation) standardising on common fields / attributes (also in payloads) is something that provides high value and is very well appreciated by the community.
Of course there can be custom events with custom fields that are not documented and not standardised. But, when considering common use cases that we explicitly write semantic conventions for (e.g. how to parse NGINX logs into a structured event) we should use semantic attributes.

I want to give one example. Assume that there is an event that represents navigation in a browser environment. We want to capture the previous URL and the current URL. We cannot reuse url.full here because its name does not capture the meaning of those fields (even though the value it holds does). The fields should be something like previouls.url and current.url. Does it add a value to add this to the global registry?

@martinkuba In ECS, we have the concept of reusing field definitions in other namespaces to address exactly this use case. For example, we define geo.* fields here and in addition specify in which other namespaces these fields can be nested. So, on a single event you could have something like client.geo.location and server.geo.location to differentiate the context.

We also discussed in the SemConv WG to introduce a similar concept in OTel semantic conventions and have agreement that it's something we want to do. (Next steps is to work on a proposal for the syntax and tooling to realize it). Maybe that would help with the above concern.

scheler commented 11 months ago

A big portion of ECS (and the value it provides to the ECS community) comes from the area of describing events and their payload cross signal and cross use cases.

@AlexanderWert can you please point to an example in ECS describing the fields of payload for a given event type?

AlexanderWert commented 11 months ago

@AlexanderWert can you please point to an example in ECS describing the fields of payload for a given event type?

@scheler Sure, it's mainly defined by the Elastic integrations which are with their documentation pages to some extent comparable to the semantic conventions in OTel.

Let's consider a very concrete example: NGINX Ingress Controller. That page describes in very detail, how an access log from the ingress controller is parsed into ECS fields. In OTel terms, that page describes the semantic conventions for NGINX Ingress Controller logs.

(Please bear in mind that with ECS there's no differentiation between top-level fields, attributes and payload as it is with OTel / OTLP. But, that doesn't make a difference for the below reasoning regarding payload).

In the "Exported fields table" you will find both:

many ECS fields (related to the above discussion: this is comparable with the attributes registry)
but also NGINX-specific fields (these would be the use case specific fields)

Some inspirations we can get from this example are:

where meaningful, we should use generic semantic conventions fields
sometimes it makes sense to have use-case specific fields
- BUT: those should still have a meaningful namespace / prefix to be easily searchable / filterable without the need of additional attributes to define the context. (e.g. nginx_ingress_controller.access.upstream.name instead of just upstream.name)
- AND: even the domain specific fields need to be documented in semantic conventions in a consistent way (as it helps users to better understand and consume data)

As part of the ECS contribution we also hope to contribute semantic conventions and parsers / integrations for many of these integrations to the OTel ecosystem (which will turn the ECS addition into actual, consumable value for the community). The event / logs signal will be crucial for that.

martinkuba commented 11 months ago

If the answer to this is "no", where else would those fields be defined? Wouldn't they be defined at all? Or would there be another event-specific registry independent of the semantic conventions attribute registry?

@pyohannes They would still be defined here, but along with the definition of the event. For every event name, there would be a list of fields that should be included in the payload for that event. I suppose this is similar to how attributes are associated with metric names right now.

martinkuba commented 11 months ago

@AlexanderWert Thank you for the context around ECS. This makes sense, especially if we are looking to adopt existing proven conventions from ECS.

I think based on that it comes down to two questions: where are event-specific fields defined and should they have unique namespaced names?

I see that ECS can combine common fields (akin to OTel global-registry attributes) and fields specific to the integration. The NGINX-specific fields are defined along with the integration, but not in the ECS schema? In the OTel semantic conventions, if we have a document that describes an event with its payload fields, and one of those fields is specific to this event (similar to nginx_ingress_controller.access.upstream.name), does this field need to be defined elsewhere and only referenced here? Or, would we follow the ECS example where the specific fields are defined along with the integration?

BUT: those should still have a meaningful namespace / prefix to be easily searchable / filterable without the need of additional attributes to define the context. (e.g. nginx_ingress_controller.access.upstream.name instead of just upstream.name)

I think the key takeaway for me is that we want all field names to be self-descriptive without the need to know the event name. Is this correct?

The way I have been thinking about this is that the payload of the event is always interpreted in the context of the specific event. We would not be looking at individual payload fields independently. In theory, the payload could contain other type of data than a list of name/value pairs (the LogRecord body is AnyValue). So, interpreting the payload data will always be done in the context of the given event.

MSNev commented 11 months ago

I think based on that it comes down to two questions: where are event-specific fields defined and should they have unique namespaced names?

I agree, based on what Alex is saying about the repetitive use cases into semantic fields that have a common meaning this is exactly what the additional attributes are for. But this is NOT what the payload should be used for.

The payload is as @martinkuba defines above is for the event-specific fields, and it is up to an event definition to identify what specific fields it requires.

tigrannajaryan commented 11 months ago

Performance

So, first, I think in order to consider performance arguments we need to see the data, measurements that compare different approaches. I don't see that data, which makes it impossible to assign relevant weight to the performance argument. I am going to ignore performance aspects for now until we see the supporting data. We can reconsider the design when there is additional evidence from benchmarking.

Design Principles

For now I think we should base our design on the following principles:

It should be possible to record an event that originates externally to Otel (e.g. Kubernetes event, a browser DOM event) in a lossless manner, preferably exactly as is so that when troubleshooting one does not have to perform mental gymnastics to understand how the Otel's representation maps to the original.
Otel events should be correlateable to other signals. Given an Otel event it should be easy to tell which other telemetry records (non-Events) are related in some way to the event and vice versa.
Otel events should follow the existing requirements for other signals unless there are strong arguments that explain how following such requirements is impossible or creates major problems.
The design must optimize for common cases. Outliers and extreme cases should be possible, but may be suboptimal, difficult to implement and may require different code paths and different solutions.
Performance arguments must be backed by comparison benchmarks.

Proposal

Based on the principles I believe we should do the following.

For events that originate externally:

Event's original data SHOULD be recorded as-is in the Body field of the LogRecord.
Otel instrumentation MAY record additional information in the Attributes field of the LogRecord.
The format and composition of the data in the Body field SHOULD NOT be defined in Otel semantic conventions since it is not under Otel's control. Where there is a formal specification for the external event Otel semantic conventions SHOULD link to such specification.
The Body field may be a primitive type, an array or a map - all values representable by any type are allowed.

For greenfield Otel events:

Define semantic conventions that record relevant information in the Attributes field.
Avoid using Body field to record data, with one exception: relevant, externally originating data with unspecified composition may be placed in the Body. Such externally originating data may also be placed in an attribute with a defined semantic convention. [we may need further guidance for this rule to explain how to choose or we can instead restrict to only one way instead having two ways to record this info].

For all events:

Attributes field SHOULD record attributes that comply with Otel semantic conventions and match the requirements of all other attributes defined in Otel. Attributes are namespaced and are part of the global registry exactly as all other signal attributes.

breedx-splk commented 11 months ago

In the interest of not blowing up the conversation too much, I have added some thoughts on this subject here: https://gist.github.com/breedx-splk/14bcc08a8cbcdef5b635cb5d22989ccc Best move is to read it top to bottom (it's not too too long, I promise) but here are the highlights for the distractible and overworked:

The case for metadata - What is is and why I think it's both important and why we should keep it separate from the event body. There's a discussion about design principles and modeling and separation of concerns.
The options on the table for modeling events
- Option 1 Just attributes
- Option 2 Attributes but event.data
- Option 3 Attributes but event.data with typed schema
- Option 4 Event body in log body
- Option 5 Event body in log body, plus typed schema (where I thought we were 3 weeks ago)

That doc doesn't at all address the idea of putting event bodies in different places depending on whether or not they are otel native. That idea is baffling to me. I think having a single consistent place to look for the event body makes way more sense than having it somehow be conditional.

tigrannajaryan commented 11 months ago

@breedx-splk I do not understand the definition of metadata you have and the example you bring.

What about an example event? Let's make up an event for popping a balloon. We decide that the important things about a balloon popping are: balloon diameter

What about metadata? Surely, there could be some supplemental information about this event, like: balloon color

Why is "balloon diameter" an event data and "balloon color" a metadata? What makes "balloon color" a "data about other data" (what other data?), but "balloon diameter" NOT a "data about other data"?

[UPDATE]: Re-reading the proposal again, it seems the separation is between the required data and supplemental (optional) data. Is that a fair litmus test to use to understand the difference between the "event data" and "metadata" in your proposal? If that is not the correct litmus test can you please define one?

martinkuba commented 11 months ago

I think the CloudEvents spec describes this well: attributes are used for context and custom data, while event data (payload/body) contains "domain-specific" information. By domain-specific I understand the data that is inherent to the event - the event can still be interpreted without its attributes but not without its data.

The button click event in @breedx-splk's examples illustrates this well. Information about the button (e.g. its name) specifically describes the event, while screen dimensions or session ID describe the context in which the event occurred.

With that said, I would like to add an argument in favor of separating the event data from other attributes. Systems that interpret events might (very likely) want to handle the event data separately from the other attributes, whether for storage, displaying in a UI, or doing some additional processing. If the event data is combined with other attributes, how would they be distinguished? The consumer would have to be aware of all field names of all events it wants to handle, or all names that are only used for context. I suppose there could be a convention like everything prefixed with the name of the event, but that does not sound like a clean solution.

tigrannajaryan commented 11 months ago

I think the CloudEvents spec describes this well: attributes are used for context and custom data, while event data (payload/body) contains "domain-specific" information.

The CloudEvents' Context has a special purpose. It is serialized independently from event data. I think Otel already has the equivalent to that: we have our own Context. Saying that the Context (whatever it is) should be recorded in the LogRecord Attributes field does not help with the CloudEvents' stated goal of independent serialization and is not necessary.

I do not see "custom data" mentioned anywhere in CloudEvents spec. Please clarify what you refer to.

@martinkuba It would be very useful to see a clear definition that explains the difference between attributes and event data.

You say:

By domain-specific I understand the data that is inherent to the event - the event can still be interpreted without its attributes but not without its data.

Is this the definition of what constitutes event data vs attributes? If it is than I am afraid it is going to be hinged on what "interpreting" means.

In the examples of @breedx-splk's proposal "session.id" is shown as supplementary (so not "inherent" to the event). I can argue both ways: a) "session.id" is not necessary for interpreting the event, b) "session.id" is absolutely necessary for interpreting the event.

Which way I argue depends on what is it exactly that I am doing when I am "interpreting". If I am trying to build the click heatmaps in my UI, I don't need the "session.id", I just need coordinates of the click events. If I am trying to replay a particular user's actions over time I absolutely need the "session.id", I cannot interpret without it.

So, to use the definition that you brought ("necessary for interpreting") I think we need to say more concretely what "interpreting" means.

martinkuba commented 11 months ago

@tigrannajaryan

I do not see "custom data" mentioned anywhere in CloudEvents spec. Please clarify what you refer to.

I was referring to the Extension Context Attributes. By context, I also meant any data that provides additional context about the event, such as the environment.

jack-berg commented 11 months ago

@breedx-splk I read your writeup - thanks for clearly laying out your thoughts. Some comments:

You try to lay out the distinction between the event and meta data with a couple of examples: a balloon, a person, and an example button click event. In all these cases, the distinction between payload and metadata is not obvious:

Why is the balloon diameter and surface material intrinsic to the balloon but the color is metadata?
Why is a person's address intrinsic but their job title is not?
Why is the button click coordinates intrinsic but the screen name is not?

In all these cases, the difference seems to be somewhat arbitrary. As @tigrannajaryan points out, I think for the distinction to be made between the payload and other attributes, we need a convincing litmus test. So far, the ones I've seen are hand-wavy.

You mention the following:

When handling an entity with a known schema, the handler does not need to try to process unexpected fields or work around unknown types.

How do you reconcile that with the need to be able to evolve event schemas, and add additional fields over time? A consumer may expect version 1.0.0 of the event schema with includes fields foo, bar, but a client might emit events with version 1.1.0 of the schema, which includes foo, bar, baz. Whether the payload is separate from attributes or not, processors will need to be able to handle fields they don't recognize.

The option 5 you present has no listed cons. I can think of a few: Terse fields have the negatives listed here. One thing we discussed in the SIG is the ability to opt into including the original unadulterated click data as provided by the browser in the OpenTelemetry event. The body is the natural place for this as it matches the description from the log data model:

However, a structured body SHOULD be used to preserve the semantics of structured logs emitted by Third-party Applications.

Any solution in which the body is used for the first-party event fields requires another opt-in field in the schema to be able to accommodate this behavior. This is part of what convinced me of @tigrannajaryan's proposal. I now imagine event semantic conventions as each defining a name, set of attributes, and description of where the body comes from (i.e. typically a link to the API used to populate it).

breedx-splk commented 11 months ago

@tigrannajaryan has said

Why is "balloon diameter" an event data and "balloon color" a metadata? What makes "balloon color" a "data about other data" (what other data?), but "balloon diameter" NOT a "data about other data"?

and @jack-berg said something very similar

Why is the balloon diameter and surface material intrinsic to the balloon but the color is metadata?

So the designer of an event schema is tasked with deciding which fields (from the set of all available things in the known universe) belong to this event. For some events, the name may be enough, and the required set of fields is just the empty set. For others, the designer must choose which fields are necessary (required) for for their specific problem domain, which fields are nice to have (optional), and which fields are not included (everything else).

The balloon example is contrived, so I mentioned let's assume that the balloon committee agreed that the above is the important stuff for most purposes.. I didn't explain the reasons why the committee decided (after several long and stressful meetings) to not include the color, but you could imagine some contrived reasons why...

If there needs to be a litmus test for whether something is a core piece of an entity or if it's metadata, I suppose it involves answering these questions:

Is this thing a fundamentally important component of the entity? How do you define fundamentally important? Who knows....philosophers maybe.
Is the entity something less without this component?
Is the entity harder to recognize without this component? Is the entity more recognizable with it?
Is the entity less useful to a consumer without it?
Would a typical consumer be surprised to see this component? Would they be surprised without it?
Is the component critically useful or merely distracting?

I'm not convinced that there exists a singular test that can be performed to make design decisions for data modeling or event design purposes. Design is a craft that requires art, experience, and consideration.

The event designer is deciding what constitutes the event based on the needs of the emitter ("I want to communicate this important thing outward") and the requirements of known receivers ("I want to handle events that have these qualities"). Events should be designed to address something in a problem space.

I think nobody would be surprised to find a coordinate in a mouse click event. They might be surprised though if it was a button click event. Those are different things that serve different purposes. If I receive a person entity and look inside and find a field called lawn_size I will probably laugh...and I've never seen that in the real world...but then again, I've never written software for a landscape architect.

If it does, in fact, turn out that the balloon's color is helpful when processing these pop events, then by all means we should improve our schema design. Similarly, if we think we'll never mess with mylar, let's remove the surface material while we're at it.

breedx-splk commented 11 months ago

@jack-berg said

processors will need to be able to handle fields they don't recognize.

Yeah totally, I agree. I concede that point as a tad flimsy...but there are pathological cases where entities have all this extra stuff that most consumers don't want or need. Heavyweight burdensome cruft. It's nice not to have to deal with that.

The option 5 you present has no listed cons. I can think of a few:

Agreed that those items you called out are applicable to Option 5. I don't agree that all of those things are outright cons, but I understand where you are coming from. I'll try and keep my take short.

Additional context required to interpret fields. - Yes, I do think that event fields are conceptually different than existing attributes. I understand the desire for consistency, but reusing attributes for rich object/event fields feels unnecessarily forced. I do think this "additional" context is fundamentally important.
A lack of ability to correlate fields between signals - I think it's ok to keep the correlatable parts as top-level attributes. It might be too early to say that we need to correlate event fields with other signal attributes (again, I'm using a conceptual difference because I think there is one).
A performance-justified race to zero. - I agree that there's a somewhat counterproductive optimization instinct at play here, but I think the race to zero is the slippery slope.

I've added a link to your comment as "cons" in my linked gist. Thanks for the input!

MSNev commented 11 months ago

Performance

Formatted response on the performance aspects of this (everyone should have comment access) https://docs.google.com/document/d/1fPMCCCsFFPOq02fZ2OfjFDKlZs7xIxLkhl7G3rnJ5Kg/edit?usp=sharing

MSNev commented 11 months ago

@jack-berg said

processors will need to be able to handle fields they don't recognize.

Yeah totally, I agree. I concede that point as a tad flimsy...but there are pathological cases where entities have all this extra stuff that most consumers don't want or need. Heavyweight burdensome cruft. It's nice not to have to deal with that.

And at Microsoft we do exactly that, we support the concept of "Custom Events" (ie. application derived events), we don't control what "fields" they include, we provide the ability to send a "custom event" and all of the payload fields are extracted out effectively into columns that the application owner can use to query however they like across these custom events. The only "mandatory" field we require is the custom event name as any form of visualization / comparison (generally) only make sense within the context of the same event.name. i.e. It would generally not make sense to compare fields within a balloon pop events with production rate of pins, unless of course if you make both pins and ballons in the same location 😃

jmacd commented 11 months ago

I'll bring in a few additional perspectives to see if it may help.

"Performance concerns" mean different things to different people. I don't need to see anyone's data to be convinced that a smaller encoding is better for performance.

Speaking for the OpenTelemetry Protocol with Apache Arrow project, I think we should get away from using the cost of a "fully-expanded" representation to justify abbreviation. In general, we should not be encoding the name of the attribute with every data item on the wire, whether the name is fully qualified or a short, event-specific field name.

Everyone arguing in this thread could have what they want, if OpenTelemetry would get better about encoding schema information. The schema in today's https://github.com/open-telemetry/otel-arrow does not specialize for the logs Body field having recursive structure, but it could be extended in that direction. What we should be aiming to see, I think, is that only the variables of an event need to be encoded, exported, and sent.

This is what you get using Arrow-based transport. By the time you are exporting that event, it will be decomposed into a number of rows; the schema of each row will be known and transmitted once per stream, and from a performance perspective it won't matter whether:

Field names are fully-qualified or implicitly qualified by event "identity"
Fields are nested in the body
Fields are expressed as attribute values

The OTel-Arrow group would be glad to discuss this; it's something @lquerel experimented with before the project donation.

So, should event fields be part of the global attribute registry? No. I think events should have schemas. We should transmit schemas once per stream, we should transmit variables once per event, and OTel-Arrow will take care of reconstituting the fully-expanded representation at the end of the pipeline, where it's somebody else's performance concerns.

It is a separate concern how schemas should be encoded and registered, and so on. This is currently under discussion in relation to a proposal, also by @lquerel, in Introducing Application Telemetry Schema in OpenTelemetry - Vision and Roadmap and the #otel-weaver channel. We think it would be best if the schema was used to compile strongly-typed interfaces, which would compile directly into an efficient implementation, that would alleviate the performance concerns found here.

Reading the gist @breedx-splk, I find myself with a few more questions.

Timestamp: is relevant, just not under debate, is that correct? Otherwise, is there is a way to order events created by a particular resource?

but for the purposes of this discussion, because they are not relevant, let's drop timestamp and severity,

Session ID: I haven't forgotten https://github.com/open-telemetry/opentelemetry-specification/issues/2500 and I still want to see us store session-id as a non-identifying resource attribute. Again, this is something we could accomplish with more schema information. The present ("weak schema") definition for resource prevents us from adding ephemeral and other descriptive information with cross-cutting applications, and I want to see us fix that.

Option 6: speaking from a past life, I could at least imagine more options than those enumerated in the gist linked above. For option six, I propose to add a couple of things to the OTel log record:

A sequence number. Log records need to be globally identifiable, and using a timestamp is not enough. When you later use a schema, it will be by reference to the log record in which it was defined (i.e., "in my own log, refer to the record with sequence 6, the definition named schema/17").
A definitions section for metadata. In the definitions section, a semantic-convention would be created to for schema definitions. The schemas of the application would be logged once -- at first use -- and subsequently known (or presumed known). Anyone interpreting a log record must have made an effort to save those definitions because they'll be referenced again. While I like this story, I find the OTel-Arrow answer is an easier path forward.

MSNev commented 11 months ago

@jmacd "Performance concerns" mean different things to different people. I don't need to see anyone's data to be convinced that a smaller encoding is better for performance.

100% agree.

And for clients there are many additional concerns which are not currently addressed, and unfortunately something like Apache-Arrow (at this point in time) is a non-starter at many different levels which include

The code required on the client https://bundlephobia.com/package/apache-arrow@14.0.1 -- 222.4kb and 50.7kb zipped (just for this library -- this is an outwrite non-starter today) for comparison here is the full SDK that we have (including the full encoding and transmission of the events)
- Application Insights -- 150kb and 55.4kb zipped (this is a general public SDK)
- For most of the internal MS sites they use a different version (because the above is too big), and they use a combination of these 2 packages (when merged into a single bundle it's actually smaller than them seperately)
- 1ds-core-js and 1ds-post-js so just each raw size this is 139Kb and 51.9 zipped (both of these include a bunch of common reused code (like the ts-utils package) so the "real net size is at least 20kb smaller)

There are also some good words and some concerns in your response as well.

I think events should have schemas 👍

We should transmit schemas once per stream, we should transmit variables once per event

If this means that for every "request" getting sent from a client, then this is problematic and depends on "how" the schema is encoded on the client. Ideally, it's not included and instead it's either "inferred" (by a name) or just referenced url. The other issue is that a "client" (browser specifically) will be sending multiple requests (streams?) per page load and each page load will be completely isolated from each other so there is no "context" of what has already been sent.

OpenTelemetry would get better about encoding schema

Encoding is also problematic, as the "time" required to serialize into said encoding is one of those critical performance concerns for client environments. Ideally, this needs to use a native environment library (for browsers the most common is JSON.stringify()) as there is no faster way to convert an object hierarchy into a serialized form.

Could Arrow be a solution (or part the of) -- maybe, but not today (from what the JS and Client Sigs' have currently seen -- moslty because of that required "bundle" size) so the following comment and the result would be key to that

We think it would be best if the schema was used to compile strongly-typed interfaces, which would compile directly into an efficient implementation, that would alleviate the performance concerns found here.

tigrannajaryan commented 11 months ago

Formatted response on the performance aspects of this (everyone should have comment access) https://docs.google.com/document/d/1fPMCCCsFFPOq02fZ2OfjFDKlZs7xIxLkhl7G3rnJ5Kg/edit?usp=sharing

Thanks @MSNev, this is very useful. I have put together a quick comparison test here: https://github.com/tigrannajaryan/exp-jsevent

It generates synthetic events based on github.com stats you provided. It would be great to modify this to use more real event composition to make sure it is a fair comparison.

So far I have this results:

In original, non-encoded form the flattened events are 85896-32550 = 53346 bytes larger (85896/32550=2.64 times larger) than in nested form (both JSON encoded).

In dictionary-encoded form the flattened events are 23389-21426 = 1963 bytes larger (23389/21426=1.09 times larger) than in nested form (both JSON encoded).

Dictionary encoding time: about 1.7ms.

I am probably the worst JS coder in the world. It would be great if someone who knows what they are doing would take this and turn into a proper benchmark (and verify that I don't have measurement bugs and the comparison is fair).

The question I have: is 53KB (or so) extra RAM, 1.7ms extra encoding time and 9% larger JSON payload (before any gzip compression) a dealbreaker?

scheler commented 11 months ago

@tigrannajaryan where did you run this test? Is it on your laptop? Please see the article on how slow some CPUs are in the mobiles of users around the world.

https://calendar.perfplanet.com/2023/wikipedias-worldwide-web-cpu-benchmark/

tigrannajaryan commented 11 months ago

@scheler it would be great if you can help produce benchmarks on relevant devices (and with more realistic payloads, mine is probably not a good approximation of real world). It will make the performance argument stronger.

lquerel commented 11 months ago

@MSNev Currently, the protocol based on Apache Arrow is not optimal for browser environments due to the current size of the Apache Arrow JS library and the specific lifecycle of web pages, which makes it hard to offset the initial cost of the Arrow schema. In the future, these two issues could be addressed by creating a variant of this protocol for constrained environments like browsers. However, this is a separate project that will take time to develop, so I won't focus on it right now.

As @jmacd highlighted, the proposed Application Telemetry Schema could address various issues discussed in this thread.

1) The Application Telemetry Schema allows importing and referencing semantic conventions and also defining application-specific events and attributes. 2) Once "resolved", an Application Telemetry Schema is self-contained and can be referenced by a URL, which can be used when emitting signals (so the schema itself will never be sent from the browser). 3) A Client SDK generator could be used to create an optimized Client SDK specifically for the application's needs, ideally containing only necessary code without any generic code. 4) Regarding the size of field/attribute names, it's possible to establish a dictionary in the resolved schema to maintain OTel conventions and ease of field interpretation without impacting performance. Browser-emitted messages referencing the schema by URL could use the numeric encoding representation of fields defined in the resolved schema. This approach is compatible with the short lifecycle of web pages, with the dictionary being known in advance and present only server-side.

I believe this approach meets performance constraints while preserving semantic conventions and the need to define fields/attributes that are easily interpretable and rich in metadata.

breedx-splk commented 11 months ago

@jmacd

Timestamp: is relevant, just not under debate, is that correct?

Correct. I just wanted it off the table while we discuss the other bits.

MSNev commented 11 months ago

@lquerel

Currently, the protocol based on Apache Arrow is not optimal for browser environments due to the current size of the Apache Arrow JS library and the specific lifecycle of web pages

This was our (OT Sig's) observation as well from looking at the 1000ft level with the details we had, so it good to see that we didn't miss anything.

Application Telemetry Schema

Yes, I agree, the definition of "events" (especially application custom events), has a very high degree of correlation with what we are trying to do in the Event WG

Make it easy to identify that a receiving system "knows" that the thing it's looking at is an event (the presence of an event.name attribute)
Based on this "name" backends can effectively "infer" the "schema" of that event both directly ("shared knowledge of what constitutes the payload fields of the event) and indirectly (having all payload fields separated from the standard attributes (which will provide additional context / grouping but not necessarily part of the event)
Have this "shared" understanding defined in Semantic Conventions so that for any language that wants to create an instrumentation (or any other code) for any "defined" OpenTelemetry defined event that all languages would create it in the same manner. Simple examples for web pages might be both JS and PHP inject code onto the page to produce "Page view" events etc.

Do we "need" a schema, we had been going down the initial path of not "requiring" one. Would it be useful as a generic definition -- absolutely. Should the schema be required to be included in the outbound telemetry for events to be useful -- no (on the "assumption" that we have a set of "well known" (and defined) events where the event.name is unique enough to uniquely identify each event (which I believe we have documented this already in the spec), so then backends could reference any published schema to "handle" these well known event definitions.

lquerel commented 11 months ago

@MSNev

Should the schema be required to be included in the outbound telemetry for events to be useful -- no (on the "assumption" that we have a set of "well known" (and defined) events where the event.name is unique enough to uniquely identify each event (which I believe we have documented this already in the spec), so then backends could reference any published schema to "handle" these well known event definitions.

Yes, events without a strict schema are useful as they are. But to further optimize this scenario, and to facilitate interoperability between systems, I think that yes, a schema is useful and not necessarily constraining. This is essentially what has made protobuf or thrift successful at the backend level, and it is also the direction that REST interfaces have taken with OpenAPI (and TypeSpec).

jmacd commented 11 months ago

@MSNev about the use of "performance arguments", I want to try and maintain the following balance:

If an argument like "1000 extra bytes really matters" is true for a large computer platform, because there really are large numbers that, when multiplied by 1000, cost a very significant amount, then the following should hold.
If you are in a position where the above statement is true, you are required to act the part. For this to be true, you cannot also be afraid of extreme-optimization challenges or expect to use simple, off-the-shelf software solutions.

To say that Arrow is a non-starter, to me, breaks this second principle. No one said you should use an Arrow library to accomplish this. The Arrow data specification combined with OTel-Arrow definitions is what you will use; I expect you to compile a custom event interface that is aware of your schema; I expect you to use WASM, and just as much code as necessary. Also, I won't be surprised if Arrow interfaces become standardized for browser environments. If 1000 bytes is important to you, I expect you have the resources to find a solution to this problem.

tigrannajaryan commented 11 months ago

One more thing regarding the Schema files and Body field.

The Schema files currently allow describing the changes to Attributes field, but not to the Body field. If we put any data in the Body field the schema of which we control we either 1) will have no ability to describe changes that happen to such data or 2) we will need to add the ability to describe changes in the Body field to the Schema File.

The later is additional complication and likely will be more complex than how Schema files describe Attributes (because in Body filed we will allow nesting of fields and it creates more variations of data changes).

MSNev commented 11 months ago

Understand, I think what we are thinking about in terms of schema (specifically along the lines of an application schema), is more akin to the way that Cloud Events describes it for the definition of the payload.

tigrannajaryan commented 10 months ago

@scheler I have rerun the test on the slowest phone that is available to me (Oneplus, 2017 model year) and I get around 3ms of encoding time for that synthetic 121 event batch that I modelled after @MSNev's example from github.com.

Updated benchmarks, with comparison to JSON.stringify, standard OTLP JSON and a new encoding that uses dictionaries: https://github.com/tigrannajaryan/exp-jsevent/blob/main/README.md

Anyone who has access to slower devices feel free to rerun the benchmark. So far the numbers I am seeing look acceptable to me.

scheler commented 10 months ago

@tigrannajaryan thanks for creating the benchmark tests. I checked that Oneplus 2017 is about twice faster than Moto G5 that's used as a reference in the wikipedia cpu benchmark tests, and noting that in South Africa 25% users have devices slower than Moto G5, we can deduce that this encoding could take ~8-10ms on slower devices, compared to 1ms on faster devicds. Martin also pointed out that this is "additional time", on top of the time the original apps already spend, in batching, serializing and sending the content out. People have performance budgets for the whole app and so numbers matter when choosing additional libraries to integrate into their app.

It's interesting you included JSON as-is in your benchmarking tests - it's only slightly larger than "OTLP with key dictionary" when used in the form "Nested, in Body, no namespaces" and also about half the size of OTLP JSON. The payloads will be lot smaller if we use key dictionary on top of JSON as-is. I guess it makes a case to support JSON as-is in OTLP irrespective of any further encoding. This will be helpful for cases where protobuf is not used, for eg., pages that automatically redirect to others give you very little to do anything. I noticed this was discussed in the Event SIG last week - will this be a v2 of OTLP JSON or this will have to be a transport encoding on top of OTLP JSON?

tigrannajaryan commented 10 months ago

It's interesting you included JSON as-is in your benchmarking tests

@scheler yes, I intentionally included it as it is likely the lower bound of what is possible from encoding duration perspective since json.stringify() is just a single call to a native function. The difficulty with as-is encoding is that it essentially fully couples the in-memory representation of events in the SDK with the wire protocol and once published makes difficult/impossible to refactor the SDK's in-memory data models since that will break the wire format. It may be acceptable for hyper-optimized Otel implementations but likely is a no-go for a generally useful Otel SDK and protocol that needs to evolve over time.

The payloads will be lot smaller if we use key dictionary on top of JSON as-is

I suspect "as-is with dictionary" may have roughly the same encoding execution time as the "OTLP with key dictionary". I may be wrong, it is worth experimenting with. I posted the benchmark source codes so that anyone is free to play with it and try adding new formats for comparison.

it makes a case to support JSON as-is in OTLP irrespective of any further encoding. This will be helpful for cases where protobuf is not used, for eg., pages that automatically redirect to others give you very little to do anything.

I am not sure I understand this part. Can you elaborate?

scheler commented 10 months ago

it makes a case to support JSON as-is in OTLP irrespective of any further encoding. This will be helpful for cases where protobuf is not used, for eg., pages that automatically redirect to others give you very little to do anything.

I am not sure I understand this part. Can you elaborate?

For short-lived pages - that immediately redirect to another page or pages that have a high chance of users navigating away immediately - we can deliver a lighter version of javascript bundle that includes just one instrumentation and a minimal SDK with only Events support, no batching, no protobuf, to fire off a pageview event that can be exported immediately. Since it's just one event, OTLP JSON should be okay if JSON as-is is not possible for the reasons you outlined.

martinkuba commented 10 months ago

A few more thoughts on event data as attributes or body:

There will be three kinds of events: OTel-defined events, 3rd-party events, and custom events (added by the application via an API).

Consistency - We have talked about putting the data of a 3rd-party event in the body, and the data of OTel-defined event in attributes. I think this is confusing and inconsistent for backends processing events. If there is such a thing as event data, then it should be in the same place for all events, so that backends do not have to treat different events differently.

Regarding the litmus test of what data goes in event data vs attributes - I think this is independent of where it should go in the data model. For OTel-defined events, we will still need to document what fields are associated with a specific event. I expect that each OTel-defined event will have a document in the semantic conventions documenting its name and the data (e.g. list of fields) that it carries.

IMO, the only confusion of whether to put some data in the body or attributes would happen with custom events. In that case, the end user has control and they can decide where to put their data based on how it is used. If we combine event data with other attributes, and want it to not be confusing, should we not have a concept of event data in the API? But we still need it in the API for 3rd-party events?

I think the main question for me is how events will be processed and stored. Will there be use cases where the backend needs to distinguish between event data and attributes? If yes and we do not separate them in the transport, then we will make this use case impossible, at least for custom events - assuming that a schema could be used to extract event data attributes for OTel events, and for 3rd-party events it would be in the body.

jack-berg commented 10 months ago

Both in today's 11/12/24 event SIG and the previous 12/8/23 meeting, we just discussed the idea of start with a set of principles and letting the design fall out of that. The idea being that designing without a guiding set of principles can end up feeling subjective, be hard to defend, and lack coherence.

I propose the following design principles, ordered in terms of priority (high priority items first):

Symmetry with existing OpenTelemetry signals and concepts
Data modeling is decoupled from the transport
Suitable for different languages, different runtime environments
Intuitive API for users
Intuitive data model for consumers

From these, the following design falls out. I'm being brief because the characteristics of these things has been discussed elsewhere. The numbers in parenthesis indicate the number of the design principle that informs that decision.

Event fields are represented as attributes. The log record body is unused. (1, 4, 5)
Event fields are part of the global attribute registry and subject to the general naming rules, including requiring namespaces. (1, 2, 4, 5)
There is no distinction between an event's payload and an event's attributes. Consumers wishing to validate against the event schemas defined in semantic conventions can drop or error when encountering recognized attributes. (4)
Define an attribute to hold the raw payload (e.g. event.raw or event.source_payload), such as the the kubernetes defined payload produced by the k8seventsreceiver or the raw payload provided by a browser when a click occurs (probably opt in for users). (5)
Evolve OTLP such that constrained environments need not include fully qualified attribute names. This might involve a dictionary encoding, or generated client libraries, or updating semantic conventions to include unique shortened identifiers for all attributes in the global registry which can be used as alternatives on the wire. (3)

martinkuba commented 9 months ago

Here is a counter-proposal, separated into two parts that I think can be decided separately.

Separation of event data and attributes

Design principles:

Consistency: OTel-defined and external events should be handled the same way.
Intuitive / clear design: It should be clear how to understand and process an event without following a complicated set of rules.
Flexibility: the data model should make it possible for backends to process event data separately from other attributes in a clear way. We cannot anticipate all use cases of how backends will handle events. By separating event data from attributes, we make it possible for different use cases of ingesting events to be implemented.

Proposal:

Event data is separate from other attributes in the data model (1, 2, 3)
Event data is transported in the LogRecord body field for all events - external and OTel-defined (1, 2)
Optional attributes event.data_type and event.schema define the data type and schema for reading/validating contents of the body field (2)

Naming of event data fields for OTel-defined events

Design principles:

Make it possible for data of events to be defined in a way that can meet requirements of specific domains (e.g. minimize data size in client applications).

Proposal:

For OTel-defined events, the event data can be any data type. If it is a collection of name-value pairs, the names do NOT need to be part of the global attribute registry since they are defined in the context of the event.

MSNev commented 9 months ago

I've not had time to formally enumerate my design principals, but they will be based on my original comment https://github.com/open-telemetry/semantic-conventions/issues/505#issuecomment-1810865899 which we can talk about in the SIG.

breedx-splk commented 9 months ago

I agree with a lot of the above! I'll offer up my approach:

Design principles:

Simplicity over reuse.
Minimal surprise.
Easy for event designers to use to design new events.
Easy and simple for event consumers to read/parse/validate.
Extensibility is good.

Rather than make a proposal yet (wherein my $0.02 adds a circle to the growing Venn diagram), I want to offer some push back / alternatives to existing/above ideas.

I think events are now already namespaced. We decided that previously in #473 when the domain was added to the name. I love this. It means a user knows exactly what to search for to look up an event semconv/spec, and it makes event definitions unambiguous, analogous to ARN or schemaUrl or xmlns. Because events are defined inherently within a namespace already (based on their, ahem, name!), I don't think it's necessary to namespace the fields within an event.
I believe that forcing a namespace on event members/fields for consistency with existing attribute definitions is a mistake. While I can understand the desire for singular definition, consistency between signal types, and not repeating oneself, I think there's better ways to alleviate this.
If an event designer wants to include an entity/field/attribute/whatever-we're-calling-it-now that has been already specified somewhere else in our specification, they should be able to reference it or maybe even alias it.
I think a reference or alias to an existing field/attribute should be part of the event definition (eventually schema?), and doesn't need to be communicated through the field name.
If an event designer discovers that they want to include a field that is already defined in another event, they have a couple of choices. They can simply reference it (same as saying "it's the same as this other field in this other event") or they can pull the field up to some topmost semconv (like attributes are today) and then reference it in both event definitions. If we hate choice, we can dictate the latter I guess.
I don't think we need to have a formal schema out of the gate. I think it is fine to have events defined in markdown for quite some time. If we want to overlay a schema format in the future, I'm optimistic about that being possible and probably rather easy.

MSNev commented 8 months ago

We will open new issues related to the outstanding issues here as the original question is now answered.

jack-berg commented 8 months ago

What's the resolution?

MSNev commented 7 months ago

The answer to the question posed Should event fields be part of the global attribute registry?

The answer is No, the payload fields should not be part of the global semantic registry, we are moving forward to define how the payload fields will be defined. https://github.com/open-telemetry/semantic-conventions/issues/755

martinkuba commented 7 months ago

Part of what is discussed here is whether the event payload should be represented in the log body or as a number of individual attributes. This has already been addressed in https://github.com/open-telemetry/semantic-conventions/pull/566 by specifying that the event payload should be in the log body.

Modeling the event payload is currently being worked on (https://github.com/open-telemetry/semantic-conventions/issues/755 as @MSNev pointed out).

I am closing this issue since it appears the discussion here has been exhausted, and more specific discussions are now happening in other issues. But if anyone wants to continue this conversation here, please feel free to re-open this issue.