Closed martinkuba closed 7 months ago
Just to document my take on this and this will include some generic definitions on how "events" should be defined.
By Default for the payload
(fields)
fields
it needs to describe its payload)fields
included into the payload (for OpenTelemetry defined events) MUST be defined in semantic conventions, with the event.name
and each field MUST be described (required / optional) and any description.field
MAY reference out to a definition of the global semantic convention, if the field
needs to conform to an already well defined semantic convention, but does not want / need to use the same fully qualified name (eg. due to runtime constraints , there are multiple fields
that would confirm to the same definition or because the field names are defined by an external definition but OpenTelemetry has already documented the definition)events
MAY include the same field
name within their own payload
and this DOES NOT infer that they are the same concept as all fields of a payload
MUST be considered to be unique to each specific event.name
. However, as identified above the field
definition above MAY link out to a global semantic convention to provide this weak assertion (if required).
fields
of different events SHOULD NOT generally be a candidate for grouping (unless otherwise specified between specific events event.name
, eg. Event A MAY defined the same field as Event B AND define these as representing the same value, so they could be used for some form of grouping / metric (if required).Optionally
event.name
is free to define its own set of payload fields
field
MAY use the fully namespaced global registry defined semantic convention, it is up to the domain specific group defining the eventpayload
of an event is NOT REQUIRED, again, it's up to the specific event domain to determine how to best represent the event (eg. there may be some events that do not require a payload and thus do not need to specify an empty payload
)event
definition (whether defined by OpenTelemetry, a vender or application event), MAY identify top level attributes (ie not withing the payload
) that MAY be expected to be included with the event
Optional (additional) Attributes (not the payload
fields)
I’ve stated my opinion on this before and engaged in lengthy discussions in my Event payload proposal document. I’ll recap my stance and offer some old and new arguments:
Event fields should be part of the global attribute registry and should be subject to the same naming rules as other attributes.
In a nutshell, the discussion comes down to a question of whether the constrained environments (limited network bandwidth, CPU, package size) of certain client scenarios should influence how OpenTelemetry models semantic conventions for Events (and presumably other signals as well). The answer should be no. We should clearly decouple modeling the shape of telemetry (i.e. semantic conventions) from how telemetry is sent over a network. Accepting that Event fields are different from attributes on other Signals would result in:
Another aspect of the discussion is whether including Event fields in the global attribute registry is unnecessarily cumbersome for people looking to contribute Event semantic conventions. Do we want a bunch of domain-specific one-off attributes? Do we want contributors of Event semantic conventions to have to understand the universe of attributes such that they re-use an existing attribute when the concept has the same semantics? Do we want domain experts to have to define their Events in collaboration with semantic convention maintainers? Yes, yes, and yes. The process of defining semantic conventions is intended to be low overhead and accommodate domain specific concepts. Defining Event semantic conventions should be subject to the same process and scrutiny as any other convention. There’s no difference between a group wanting to define domain specific Event conventions and wanting to define domain specific Span conventions. Both require defining how domain specific concepts fit into semantic conventions.
Up to this point, I’ve been expressing my own opinion. However, we talked about this issue in the 11/15/23 TC meeting and there was agreement among the present TC members that Event fields should follow the same process as other attributes defined in semantic-conventions. We discussed that OTLP as currently defined may not be suitable for certain extremely resource constrained environments, and that the Client WG in collaboration with the Event WG should work to prototype changes to the protocol to accommodate these environments. Ideas included various versions of the symbol dictionary, or using a columnar format like the OpenTelemetry Arrow Project, or even a stateful domain specific protocol. Note that this opinion from the TC reflects consistency on a previous vote about whether metric attributes should be namespaced. The last comment on that issue is particularly relevant here:
I think the main decision here was that they are conceptually namespaced and that's how we will describe them in semantic conventions. On this basis, any transmission optimizations can still be built.
An interesting aspect of this problem is that it is not limited in scope to the Client WG, or to Event WG. It involves questions of data modeling that squarely intersect with the Semantic Convention WG, it has implications on how the different telemetry signals interact with each other which is a clear specification level issue which is maintained by the TC, and it has implications on all maintainers since Events will be used outside of the client instrumentation space.
So far, we’ve heard a lot from the perspective of the Client and Event WG, and we’ve discussed this in the TC. I’m interested in whether there are any additional perspectives from other semantic-conventions contributors (other than the maintainers who are also on the TC) and from language maintainers? Are all the arguments on the table or are there additional things we haven’t yet considered?
Up to this point, I’ve been expressing my own opinion. However, we talked about this issue in the 11/15/23 TC meeting and there was agreement among the present TC members that Event fields should follow the same process as other attributes defined in semantic-conventions. We discussed that OTLP as currently defined may not be suitable for certain extremely resource constrained environments, and that the Client WG in collaboration with the Event WG should work to prototype changes to the protocol to accommodate these environments. Ideas included various versions of the symbol dictionary, or using a columnar format like the OpenTelemetry Arrow Project, or even a stateful domain specific protocol. Note that this opinion from the TC reflects consistency on a previous vote about https://github.com/open-telemetry/semantic-conventions/issues/51 The https://github.com/open-telemetry/semantic-conventions/issues/51#issuecomment-1608853732 on that issue is particularly relevant here:
I think the main decision here was that they are conceptually namespaced and that's how we will describe them in semantic conventions. On this basis, any transmission optimizations can still be built.
I want to document my VERY STRONG disagreement with this approach on several levels that I'm not quite sure where to start...
Strongly agree:
Strongly disagree:
AnyValue
just to avoid the prefix for the EXACT same reasons that I'm pushing back on this for the OpenTelemetry defined events
.event
(ie. the event.name
) is what you are proposing here as that is the ONLY viable option, prefixing with just the namespace
of the event WILL cause clashes with different fields in the same domain namespace
group.
serialization
this generates further excessive bytes (further compounding the OTLP transport problems), and using some sort of client side "compressing" (symbol disctionary, etc) does not help this situation when you have LOTS of unique events with unique namespaces
as every field name
is now unique and has to be listed in the symbol table (where it's most likely only EVERY used once).serialization
is a SUPER CRITICAL time sensitive operation and additional "time" taken to "compress" MUST be avoided at all costs.JSON
implementations, which also has the advantage of being the "fastest" (using the native serialization) conversions and therefore having the lowest impact (not withstanding the above comments about the creating of the long strings and the current OTLP format) on any "waiting" user -- server environments DON'T have this issue ALL client environment DO (or CAN - some can offload to another thread -- but not all).Possible "tweaks" which would make the "logical" prefixing of an event payload "fields"
payload
Attribute (event.data
/ body
)A "general" OTLP transport improvement(s)
namespace
(including ALL of the ones defined in the registry) as sent in a containing AnyValue
(should really be a new special type) where the name
of the top level Attribute is the namespace
with the value
being a Map of attributes with the namespace
removed (just the name)
http
conventions would be in a sub-object http: { method: "GET", ... }
etctype
(at least for the JSON representation - binary formats will always need to define the "encoding")). That way "strings", "numbers" (caveat with int / float), "arrays" "map" (represented as an object) and all other types would still "need" there "type. There is also a "special" need required for when transmitting fields that "may" requied metadata, to identify fields that may contain PII information so that backends can handle as required, depending on the provider / vender.I'm going to (try to) stop now and I'm sure we will (and need) to discuss further in the Event WG (and other SIGs), for reference I have create a draft white paper (not yet public -- being reviewed and may not be able to be fully public) on why Microsoft is not yet using OpenTelemetry for the Web (some of this is summarized above, and I've talked in lots of SIG's about the reasons), this draft internal paper is 17 pages long with all of the issues and what is required.
And for clients breaking things down and trying to address (and / or punt) different concerns down the road doesn't work because they are all inter-related to different degrees because
To Date OpenTelemetry (appears) (IMHO) to have been driven by the "final" storage format used by many organizations and then creating wrappings around these "storage" formats for creating on servers to enable the data to be quickly dumped, and for high volume servers this is an excellent approach -- it just doesn't work for ALL client scenarios / languages.
There will be 1000's more "Custom" events then there will be OpenTelemetry defined events and requiring that every "Custom" event MUST have every "field" for the custom event prefixed is just not viable for most applications. This will "force" application custom level events to end up using AnyValue just to avoid the prefix for the EXACT same reasons that I'm pushing back on this for the OpenTelemetry defined events.
No. The attribute naming rules use strong normative language OpenTelemetry authors, but we can't enforce what application authors do for custom events. At best, we can define recommendations. For events defined outside OpenTelemetry and bridged over, we should minimally transform the original payload with mapping rules. The naming rules for are OpenTelemetry semantic conventions.
Bundle sizes in some languages (runtime constraints on the size of the code - ie browsers), strings are not compressable Transport payload serialization is a SUPER CRITICAL time sensitive Prefixing values to runtime defined fields just causes additional memory (and CPU) usage which are then batch (holding onto that in memory) and then during transport serialization this generates further excessive bytes
We shouldn't jeopardize the coherence and clarity of the telemetry semantic conventions for the requirements of extremely constrained environments. We must decouple data modeling from transmission, and work to accommodate constrained environments with transmission level solutions.
Possible "tweaks" which would make the "logical" prefixing of an event payload "fields". All of the payload fields for the events are "grouped" as the payload for an event and when present in the global registry they are "listed" with the prefix, BUT are NOT sent in this manner, they are still SENT (without the prefix) in a payload Attribute (event.data / body)
This is an example of a "transmission optimization" described in this comment, which is compatible with the notion of treating event fields as attributes.
Remove the "need" for every attribute to have it's type (at least for the JSON representation - binary formats will always need to define the "encoding")).
You're describing a breaking change to the OTLP JSON protobuf encoding. A variation which doesn't include the type is possible with a new protocol. This is still possible, since as this comment points out, "any transmission optimizations can still be built".
the specifications drive the internal representations in-memory representation and the code required to support the transport format along with the specification also drives the in-memory representations and the code required to support the combination of the above drives the conversions (DTO's) required to go from the user (application) input, to the enforced specifications (readonly in some cases) and the into a form that can be serialized and the serialization process can then require further "processing"
It sounds like your issues go well beyond the representation of Event fields, bubbling into the SDK specification. The SDK specification are stable of course, and its hard to imagine evolving them in a way that would accommodate the extreme requirements you describe. If the issues exist for all client use cases, it may suggest that a domain specific SDK might need to be developed for the specialized conditions which is not subject to the normal SDK specification language. If the issue only occurs in client application of a certain scale (or under certain constraints), it may be worth considering developing a custom SDK for your specific use case. The OpenTelemetry decoupling of API and SDK was designed for this purpose.
My view is the following - I agree that we should decouple the conversation about data modeling and performance overhead. There is no guarantee that attribute names without namespaces will be enough to address performance requirements of some applications. This should be addressed by choosing different exporters/protocols and perhaps even different SDK implementations, and application owners can choose which makes the most sense for their requirements. I also think that we need to just focus on data modeling at this point to make progress.
With that said, I am not 100% convinced that event payload fields need to be in the same global registry as regular attributes. Currently, attribute semantic conventions have a global scope, while event fields are scoped to the particular event. The combination of the event name and a field name should be enough to understand what the field represents. And in order to enable reusability, perhaps it should be allowed to use an attribute from the global registry as an event field.
I want to give one example. Assume that there is an event that represents navigation in a browser environment. We want to capture the previous URL and the current URL. We cannot reuse url.full
here because its name does not capture the meaning of those fields (even though the value it holds does). The fields should be something like previouls.url
and current.url
. Does it add a value to add this to the global registry?
I would like to hear from @open-telemetry/specs-semconv-maintainers and @open-telemetry/specs-semconv-approvers about this topic. If there is a consensus that event fields should be in the global registry from this group and the TC, then I will be in favor as well in order to make progress.
Should fields in the event body be defined in the global attributes registry?
If the answer to this is "no", where else would those fields be defined? Wouldn't they be defined at all? Or would there be another event-specific registry independent of the semantic conventions attribute registry?
I agree with everything that @jack-berg wrote in https://github.com/open-telemetry/semantic-conventions/issues/505#issuecomment-1815289100.
Some additional perspective:
client.geo.*
has always the same meaning independent whether it's on a span, comes from an NGINX access log, or is attached to a mobile error event).I want to give one example. Assume that there is an event that represents navigation in a browser environment. We want to capture the previous URL and the current URL. We cannot reuse url.full here because its name does not capture the meaning of those fields (even though the value it holds does). The fields should be something like previouls.url and current.url. Does it add a value to add this to the global registry?
@martinkuba In ECS, we have the concept of reusing field definitions in other namespaces to address exactly this use case. For example, we define geo.*
fields here and in addition specify in which other namespaces these fields can be nested. So, on a single event you could have something like client.geo.location
and server.geo.location
to differentiate the context.
We also discussed in the SemConv WG to introduce a similar concept in OTel semantic conventions and have agreement that it's something we want to do. (Next steps is to work on a proposal for the syntax and tooling to realize it). Maybe that would help with the above concern.
A big portion of ECS (and the value it provides to the ECS community) comes from the area of describing events and their payload cross signal and cross use cases.
@AlexanderWert can you please point to an example in ECS describing the fields of payload for a given event type?
@AlexanderWert can you please point to an example in ECS describing the fields of payload for a given event type?
@scheler Sure, it's mainly defined by the Elastic integrations which are with their documentation pages to some extent comparable to the semantic conventions in OTel.
Let's consider a very concrete example: NGINX Ingress Controller. That page describes in very detail, how an access log from the ingress controller is parsed into ECS fields. In OTel terms, that page describes the semantic conventions for NGINX Ingress Controller logs.
(Please bear in mind that with ECS there's no differentiation between top-level fields, attributes and payload as it is with OTel / OTLP. But, that doesn't make a difference for the below reasoning regarding payload).
In the "Exported fields table" you will find both:
Some inspirations we can get from this example are:
nginx_ingress_controller.access.upstream.name
instead of just upstream.name
) As part of the ECS contribution we also hope to contribute semantic conventions and parsers / integrations for many of these integrations to the OTel ecosystem (which will turn the ECS addition into actual, consumable value for the community). The event / logs signal will be crucial for that.
If the answer to this is "no", where else would those fields be defined? Wouldn't they be defined at all? Or would there be another event-specific registry independent of the semantic conventions attribute registry?
@pyohannes They would still be defined here, but along with the definition of the event. For every event name, there would be a list of fields that should be included in the payload for that event. I suppose this is similar to how attributes are associated with metric names right now.
@AlexanderWert Thank you for the context around ECS. This makes sense, especially if we are looking to adopt existing proven conventions from ECS.
I think based on that it comes down to two questions: where are event-specific fields defined and should they have unique namespaced names?
I see that ECS can combine common fields (akin to OTel global-registry attributes) and fields specific to the integration. The NGINX-specific fields are defined along with the integration, but not in the ECS schema? In the OTel semantic conventions, if we have a document that describes an event with its payload fields, and one of those fields is specific to this event (similar to nginx_ingress_controller.access.upstream.name
), does this field need to be defined elsewhere and only referenced here? Or, would we follow the ECS example where the specific fields are defined along with the integration?
BUT: those should still have a meaningful namespace / prefix to be easily searchable / filterable without the need of additional attributes to define the context. (e.g. nginx_ingress_controller.access.upstream.name instead of just upstream.name)
I think the key takeaway for me is that we want all field names to be self-descriptive without the need to know the event name. Is this correct?
The way I have been thinking about this is that the payload of the event is always interpreted in the context of the specific event. We would not be looking at individual payload fields independently. In theory, the payload could contain other type of data than a list of name/value pairs (the LogRecord body is AnyValue). So, interpreting the payload data will always be done in the context of the given event.
I think based on that it comes down to two questions: where are event-specific fields defined and should they have unique namespaced names?
I agree, based on what Alex is saying about the repetitive use cases into semantic fields that have a common meaning
this is exactly what the additional attributes are for. But this is NOT what the payload should be used for.
The payload is as @martinkuba defines above is for the event-specific fields
, and it is up to an event
definition to identify what specific fields it requires.
So, first, I think in order to consider performance arguments we need to see the data, measurements that compare different approaches. I don't see that data, which makes it impossible to assign relevant weight to the performance argument. I am going to ignore performance aspects for now until we see the supporting data. We can reconsider the design when there is additional evidence from benchmarking.
For now I think we should base our design on the following principles:
Based on the principles I believe we should do the following.
For events that originate externally:
Body
field of the LogRecord
.Attributes
field of the LogRecord
.Body
field SHOULD NOT be defined in Otel semantic conventions since it is not under Otel's control. Where there is a formal specification for the external event Otel semantic conventions SHOULD link to such specification.Body
field may be a primitive type, an array or a map - all values representable by any
type are allowed.For greenfield Otel events:
Attributes
field.Body
field to record data, with one exception: relevant, externally originating data with unspecified composition may be placed in the Body
. Such externally originating data may also be placed in an attribute with a defined semantic convention. [we may need further guidance for this rule to explain how to choose or we can instead restrict to only one way instead having two ways to record this info].For all events:
Attributes
field SHOULD record attributes that comply with Otel semantic conventions and match the requirements of all other attributes defined in Otel. Attributes are namespaced and are part of the global registry exactly as all other signal attributes.In the interest of not blowing up the conversation too much, I have added some thoughts on this subject here: https://gist.github.com/breedx-splk/14bcc08a8cbcdef5b635cb5d22989ccc Best move is to read it top to bottom (it's not too too long, I promise) but here are the highlights for the distractible and overworked:
That doc doesn't at all address the idea of putting event bodies in different places depending on whether or not they are otel native. That idea is baffling to me. I think having a single consistent place to look for the event body makes way more sense than having it somehow be conditional.
@breedx-splk I do not understand the definition of metadata you have and the example you bring.
What about an example event? Let's make up an event for popping a balloon. We decide that the important things about a balloon popping are: balloon diameter
What about metadata? Surely, there could be some supplemental information about this event, like: balloon color
Why is "balloon diameter" an event data and "balloon color" a metadata? What makes "balloon color" a "data about other data" (what other data?), but "balloon diameter" NOT a "data about other data"?
[UPDATE]: Re-reading the proposal again, it seems the separation is between the required data and supplemental (optional) data. Is that a fair litmus test to use to understand the difference between the "event data" and "metadata" in your proposal? If that is not the correct litmus test can you please define one?
I think the CloudEvents spec describes this well: attributes are used for context and custom data, while event data (payload/body) contains "domain-specific" information. By domain-specific I understand the data that is inherent to the event - the event can still be interpreted without its attributes but not without its data.
The button click event in @breedx-splk's examples illustrates this well. Information about the button (e.g. its name) specifically describes the event, while screen dimensions or session ID describe the context in which the event occurred.
With that said, I would like to add an argument in favor of separating the event data from other attributes. Systems that interpret events might (very likely) want to handle the event data separately from the other attributes, whether for storage, displaying in a UI, or doing some additional processing. If the event data is combined with other attributes, how would they be distinguished? The consumer would have to be aware of all field names of all events it wants to handle, or all names that are only used for context. I suppose there could be a convention like everything prefixed with the name of the event, but that does not sound like a clean solution.
I think the CloudEvents spec describes this well: attributes are used for context and custom data, while event data (payload/body) contains "domain-specific" information.
The CloudEvents' Context
has a special purpose. It is serialized independently from event data. I think Otel already has the equivalent to that: we have our own Context. Saying that the Context
(whatever it is) should be recorded in the LogRecord Attributes
field does not help with the CloudEvents' stated goal of independent serialization and is not necessary.
I do not see "custom data" mentioned anywhere in CloudEvents spec. Please clarify what you refer to.
@martinkuba It would be very useful to see a clear definition that explains the difference between attributes and event data.
You say:
By domain-specific I understand the data that is inherent to the event - the event can still be interpreted without its attributes but not without its data.
Is this the definition of what constitutes event data vs attributes? If it is than I am afraid it is going to be hinged on what "interpreting" means.
In the examples of @breedx-splk's proposal "session.id" is shown as supplementary (so not "inherent" to the event). I can argue both ways: a) "session.id" is not necessary for interpreting the event, b) "session.id" is absolutely necessary for interpreting the event.
Which way I argue depends on what is it exactly that I am doing when I am "interpreting". If I am trying to build the click heatmaps in my UI, I don't need the "session.id", I just need coordinates of the click events. If I am trying to replay a particular user's actions over time I absolutely need the "session.id", I cannot interpret without it.
So, to use the definition that you brought ("necessary for interpreting") I think we need to say more concretely what "interpreting" means.
@tigrannajaryan
I do not see "custom data" mentioned anywhere in CloudEvents spec. Please clarify what you refer to.
I was referring to the Extension Context Attributes. By context, I also meant any data that provides additional context about the event, such as the environment.
@breedx-splk I read your writeup - thanks for clearly laying out your thoughts. Some comments:
You try to lay out the distinction between the event and meta data with a couple of examples: a balloon, a person, and an example button click event. In all these cases, the distinction between payload and metadata is not obvious:
In all these cases, the difference seems to be somewhat arbitrary. As @tigrannajaryan points out, I think for the distinction to be made between the payload and other attributes, we need a convincing litmus test. So far, the ones I've seen are hand-wavy.
You mention the following:
When handling an entity with a known schema, the handler does not need to try to process unexpected fields or work around unknown types.
How do you reconcile that with the need to be able to evolve event schemas, and add additional fields over time? A consumer may expect version 1.0.0 of the event schema with includes fields foo, bar
, but a client might emit events with version 1.1.0 of the schema, which includes foo, bar, baz
. Whether the payload is separate from attributes or not, processors will need to be able to handle fields they don't recognize.
The option 5 you present has no listed cons. I can think of a few: Terse fields have the negatives listed here. One thing we discussed in the SIG is the ability to opt into including the original unadulterated click data as provided by the browser in the OpenTelemetry event. The body is the natural place for this as it matches the description from the log data model:
However, a structured body SHOULD be used to preserve the semantics of structured logs emitted by Third-party Applications.
Any solution in which the body is used for the first-party event fields requires another opt-in field in the schema to be able to accommodate this behavior. This is part of what convinced me of @tigrannajaryan's proposal. I now imagine event semantic conventions as each defining a name, set of attributes, and description of where the body comes from (i.e. typically a link to the API used to populate it).
@tigrannajaryan has said
Why is "balloon diameter" an event data and "balloon color" a metadata? What makes "balloon color" a "data about other data" (what other data?), but "balloon diameter" NOT a "data about other data"?
and @jack-berg said something very similar
Why is the balloon diameter and surface material intrinsic to the balloon but the color is metadata?
So the designer of an event schema is tasked with deciding which fields (from the set of all available things in the known universe) belong to this event. For some events, the name may be enough, and the required set of fields is just the empty set. For others, the designer must choose which fields are necessary (required) for for their specific problem domain, which fields are nice to have (optional), and which fields are not included (everything else).
The balloon example is contrived, so I mentioned let's assume that the balloon committee agreed that the above is the important stuff for most purposes.
. I didn't explain the reasons why the committee decided (after several long and stressful meetings) to not include the color, but you could imagine some contrived reasons why...
If there needs to be a litmus test for whether something is a core piece of an entity or if it's metadata, I suppose it involves answering these questions:
I'm not convinced that there exists a singular test that can be performed to make design decisions for data modeling or event design purposes. Design is a craft that requires art, experience, and consideration.
The event designer is deciding what constitutes the event based on the needs of the emitter ("I want to communicate this important thing outward") and the requirements of known receivers ("I want to handle events that have these qualities"). Events should be designed to address something in a problem space.
I think nobody would be surprised to find a coordinate in a mouse click event. They might be surprised though if it was a button click event. Those are different things that serve different purposes. If I receive a person entity and look inside and find a field called lawn_size
I will probably laugh...and I've never seen that in the real world...but then again, I've never written software for a landscape architect.
If it does, in fact, turn out that the balloon's color is helpful when processing these pop events, then by all means we should improve our schema design. Similarly, if we think we'll never mess with mylar, let's remove the surface material while we're at it.
@jack-berg said
processors will need to be able to handle fields they don't recognize.
Yeah totally, I agree. I concede that point as a tad flimsy...but there are pathological cases where entities have all this extra stuff that most consumers don't want or need. Heavyweight burdensome cruft. It's nice not to have to deal with that.
The option 5 you present has no listed cons. I can think of a few:
Agreed that those items you called out are applicable to Option 5. I don't agree that all of those things are outright cons, but I understand where you are coming from. I'll try and keep my take short.
Additional context required to interpret fields.
- Yes, I do think that event fields are conceptually different than existing attributes. I understand the desire for consistency, but reusing attributes for rich object/event fields feels unnecessarily forced. I do think this "additional" context is fundamentally important.A lack of ability to correlate fields between signals
- I think it's ok to keep the correlatable parts as top-level attributes. It might be too early to say that we need to correlate event fields with other signal attributes (again, I'm using a conceptual difference because I think there is one).A performance-justified race to zero.
- I agree that there's a somewhat counterproductive optimization instinct at play here, but I think the race to zero
is the slippery slope.I've added a link to your comment as "cons" in my linked gist. Thanks for the input!
Performance
Formatted response on the performance aspects of this (everyone should have comment access) https://docs.google.com/document/d/1fPMCCCsFFPOq02fZ2OfjFDKlZs7xIxLkhl7G3rnJ5Kg/edit?usp=sharing
@jack-berg said
processors will need to be able to handle fields they don't recognize.
Yeah totally, I agree. I concede that point as a tad flimsy...but there are pathological cases where entities have all this extra stuff that most consumers don't want or need. Heavyweight burdensome cruft. It's nice not to have to deal with that.
And at Microsoft we do exactly that, we support the concept of "Custom Events" (ie. application derived events), we don't control what "fields" they include, we provide the ability to send a "custom event" and all of the payload
fields are extracted out effectively into columns that the application owner can use to query however they like across these custom events. The only "mandatory" field we require is the custom event name
as any form of visualization / comparison (generally) only make sense within the context of the same event.name
. i.e. It would generally not make sense to compare fields within a balloon pop events with production rate of pins, unless of course if you make both pins and ballons in the same location 😃
I'll bring in a few additional perspectives to see if it may help.
"Performance concerns" mean different things to different people. I don't need to see anyone's data to be convinced that a smaller encoding is better for performance.
Speaking for the OpenTelemetry Protocol with Apache Arrow project, I think we should get away from using the cost of a "fully-expanded" representation to justify abbreviation. In general, we should not be encoding the name of the attribute with every data item on the wire, whether the name is fully qualified or a short, event-specific field name.
Everyone arguing in this thread could have what they want, if OpenTelemetry would get better about encoding schema information. The schema in today's https://github.com/open-telemetry/otel-arrow does not specialize for the logs Body
field having recursive structure, but it could be extended in that direction. What we should be aiming to see, I think, is that only the variables of an event need to be encoded, exported, and sent.
This is what you get using Arrow-based transport. By the time you are exporting that event, it will be decomposed into a number of rows; the schema of each row will be known and transmitted once per stream, and from a performance perspective it won't matter whether:
The OTel-Arrow group would be glad to discuss this; it's something @lquerel experimented with before the project donation.
So, should event fields be part of the global attribute registry? No. I think events should have schemas. We should transmit schemas once per stream, we should transmit variables once per event, and OTel-Arrow will take care of reconstituting the fully-expanded representation at the end of the pipeline, where it's somebody else's performance concerns.
It is a separate concern how schemas should be encoded and registered, and so on. This is currently under discussion in relation to a proposal, also by @lquerel, in Introducing Application Telemetry Schema in OpenTelemetry - Vision and Roadmap and the #otel-weaver channel. We think it would be best if the schema was used to compile strongly-typed interfaces, which would compile directly into an efficient implementation, that would alleviate the performance concerns found here.
Reading the gist @breedx-splk, I find myself with a few more questions.
Timestamp: is relevant, just not under debate, is that correct? Otherwise, is there is a way to order events created by a particular resource?
but for the purposes of this discussion, because they are not relevant, let's drop timestamp and severity,
Session ID: I haven't forgotten https://github.com/open-telemetry/opentelemetry-specification/issues/2500 and I still want to see us store session-id as a non-identifying resource attribute. Again, this is something we could accomplish with more schema information. The present ("weak schema") definition for resource prevents us from adding ephemeral and other descriptive information with cross-cutting applications, and I want to see us fix that.
Option 6: speaking from a past life, I could at least imagine more options than those enumerated in the gist linked above. For option six, I propose to add a couple of things to the OTel log record:
@jmacd "Performance concerns" mean different things to different people. I don't need to see anyone's data to be convinced that a smaller encoding is better for performance.
100% agree.
And for clients there are many additional concerns which are not currently addressed, and unfortunately something like Apache-Arrow (at this point in time) is a non-starter at many different levels which include
There are also some good words and some concerns in your response as well.
I think events should have schemas 👍
We should transmit schemas once per stream, we should transmit variables once per event
If this means that for every "request" getting sent from a client, then this is problematic and depends on "how" the schema is encoded on the client. Ideally, it's not included and instead it's either "inferred" (by a name) or just referenced url. The other issue is that a "client" (browser specifically) will be sending multiple requests (streams?) per page load and each page load will be completely isolated from each other so there is no "context" of what has already been sent.
OpenTelemetry would get better about encoding schema
Encoding is also problematic, as the "time" required to serialize into said encoding is one of those critical performance concerns for client environments. Ideally, this needs to use a native environment library (for browsers the most common is JSON.stringify()) as there is no faster way to convert an object hierarchy into a serialized form.
Could Arrow be a solution (or part the of) -- maybe, but not today (from what the JS and Client Sigs' have currently seen -- moslty because of that required "bundle" size) so the following comment and the result would be key to that
We think it would be best if the schema was used to compile strongly-typed interfaces, which would compile directly into an efficient implementation, that would alleviate the performance concerns found here.
Formatted response on the performance aspects of this (everyone should have comment access) https://docs.google.com/document/d/1fPMCCCsFFPOq02fZ2OfjFDKlZs7xIxLkhl7G3rnJ5Kg/edit?usp=sharing
Thanks @MSNev, this is very useful. I have put together a quick comparison test here: https://github.com/tigrannajaryan/exp-jsevent
It generates synthetic events based on github.com stats you provided. It would be great to modify this to use more real event composition to make sure it is a fair comparison.
So far I have this results:
In original, non-encoded form the flattened events are 85896-32550 = 53346 bytes larger (85896/32550=2.64 times larger) than in nested form (both JSON encoded).
In dictionary-encoded form the flattened events are 23389-21426 = 1963 bytes larger (23389/21426=1.09 times larger) than in nested form (both JSON encoded).
Dictionary encoding time: about 1.7ms.
I am probably the worst JS coder in the world. It would be great if someone who knows what they are doing would take this and turn into a proper benchmark (and verify that I don't have measurement bugs and the comparison is fair).
The question I have: is 53KB (or so) extra RAM, 1.7ms extra encoding time and 9% larger JSON payload (before any gzip compression) a dealbreaker?
@tigrannajaryan where did you run this test? Is it on your laptop? Please see the article on how slow some CPUs are in the mobiles of users around the world.
https://calendar.perfplanet.com/2023/wikipedias-worldwide-web-cpu-benchmark/
@scheler it would be great if you can help produce benchmarks on relevant devices (and with more realistic payloads, mine is probably not a good approximation of real world). It will make the performance argument stronger.
@MSNev Currently, the protocol based on Apache Arrow is not optimal for browser environments due to the current size of the Apache Arrow JS library and the specific lifecycle of web pages, which makes it hard to offset the initial cost of the Arrow schema. In the future, these two issues could be addressed by creating a variant of this protocol for constrained environments like browsers. However, this is a separate project that will take time to develop, so I won't focus on it right now.
As @jmacd highlighted, the proposed Application Telemetry Schema could address various issues discussed in this thread.
1) The Application Telemetry Schema allows importing and referencing semantic conventions and also defining application-specific events and attributes. 2) Once "resolved", an Application Telemetry Schema is self-contained and can be referenced by a URL, which can be used when emitting signals (so the schema itself will never be sent from the browser). 3) A Client SDK generator could be used to create an optimized Client SDK specifically for the application's needs, ideally containing only necessary code without any generic code. 4) Regarding the size of field/attribute names, it's possible to establish a dictionary in the resolved schema to maintain OTel conventions and ease of field interpretation without impacting performance. Browser-emitted messages referencing the schema by URL could use the numeric encoding representation of fields defined in the resolved schema. This approach is compatible with the short lifecycle of web pages, with the dictionary being known in advance and present only server-side.
I believe this approach meets performance constraints while preserving semantic conventions and the need to define fields/attributes that are easily interpretable and rich in metadata.
@jmacd
Timestamp: is relevant, just not under debate, is that correct?
Correct. I just wanted it off the table while we discuss the other bits.
@lquerel
Currently, the protocol based on Apache Arrow is not optimal for browser environments due to the current size of the Apache Arrow JS library and the specific lifecycle of web pages
This was our (OT Sig's) observation as well from looking at the 1000ft level with the details we had, so it good to see that we didn't miss anything.
Application Telemetry Schema
Yes, I agree, the definition of "events" (especially application custom events), has a very high degree of correlation with what we are trying to do in the Event WG
event.name
attribute)Do we "need" a schema, we had been going down the initial path of not "requiring" one.
Would it be useful as a generic definition -- absolutely.
Should the schema be required to be included in the outbound telemetry for events to be useful -- no (on the "assumption" that we have a set of "well known" (and defined) events where the event.name
is unique enough to uniquely identify each event (which I believe we have documented this already in the spec), so then backends could reference any published schema to "handle" these well known event definitions.
@MSNev
Should the schema be required to be included in the outbound telemetry for events to be useful -- no (on the "assumption" that we have a set of "well known" (and defined) events where the
event.name
is unique enough to uniquely identify each event (which I believe we have documented this already in the spec), so then backends could reference any published schema to "handle" these well known event definitions.
Yes, events without a strict schema are useful as they are. But to further optimize this scenario, and to facilitate interoperability between systems, I think that yes, a schema is useful and not necessarily constraining. This is essentially what has made protobuf or thrift successful at the backend level, and it is also the direction that REST interfaces have taken with OpenAPI (and TypeSpec).
@MSNev about the use of "performance arguments", I want to try and maintain the following balance:
To say that Arrow is a non-starter, to me, breaks this second principle. No one said you should use an Arrow library to accomplish this. The Arrow data specification combined with OTel-Arrow definitions is what you will use; I expect you to compile a custom event interface that is aware of your schema; I expect you to use WASM, and just as much code as necessary. Also, I won't be surprised if Arrow interfaces become standardized for browser environments. If 1000 bytes is important to you, I expect you have the resources to find a solution to this problem.
One more thing regarding the Schema files and Body field.
The Schema files currently allow describing the changes to Attributes field, but not to the Body field. If we put any data in the Body field the schema of which we control we either 1) will have no ability to describe changes that happen to such data or 2) we will need to add the ability to describe changes in the Body field to the Schema File.
The later is additional complication and likely will be more complex than how Schema files describe Attributes (because in Body filed we will allow nesting of fields and it creates more variations of data changes).
Understand, I think what we are thinking about in terms of schema (specifically along the lines of an application schema), is more akin to the way that Cloud Events describes it for the definition of the payload.
@scheler I have rerun the test on the slowest phone that is available to me (Oneplus, 2017 model year) and I get around 3ms of encoding time for that synthetic 121 event batch that I modelled after @MSNev's example from github.com.
Updated benchmarks, with comparison to JSON.stringify, standard OTLP JSON and a new encoding that uses dictionaries: https://github.com/tigrannajaryan/exp-jsevent/blob/main/README.md
Anyone who has access to slower devices feel free to rerun the benchmark. So far the numbers I am seeing look acceptable to me.
@tigrannajaryan thanks for creating the benchmark tests. I checked that Oneplus 2017 is about twice faster than Moto G5 that's used as a reference in the wikipedia cpu benchmark tests, and noting that in South Africa 25% users have devices slower than Moto G5, we can deduce that this encoding could take ~8-10ms on slower devices, compared to 1ms on faster devicds. Martin also pointed out that this is "additional time", on top of the time the original apps already spend, in batching, serializing and sending the content out. People have performance budgets for the whole app and so numbers matter when choosing additional libraries to integrate into their app.
It's interesting you included JSON as-is in your benchmarking tests - it's only slightly larger than "OTLP with key dictionary" when used in the form "Nested, in Body, no namespaces" and also about half the size of OTLP JSON. The payloads will be lot smaller if we use key dictionary on top of JSON as-is. I guess it makes a case to support JSON as-is in OTLP irrespective of any further encoding. This will be helpful for cases where protobuf is not used, for eg., pages that automatically redirect to others give you very little to do anything. I noticed this was discussed in the Event SIG last week - will this be a v2 of OTLP JSON or this will have to be a transport encoding on top of OTLP JSON?
It's interesting you included JSON as-is in your benchmarking tests
@scheler yes, I intentionally included it as it is likely the lower bound of what is possible from encoding duration perspective since json.stringify() is just a single call to a native function. The difficulty with as-is encoding is that it essentially fully couples the in-memory representation of events in the SDK with the wire protocol and once published makes difficult/impossible to refactor the SDK's in-memory data models since that will break the wire format. It may be acceptable for hyper-optimized Otel implementations but likely is a no-go for a generally useful Otel SDK and protocol that needs to evolve over time.
The payloads will be lot smaller if we use key dictionary on top of JSON as-is
I suspect "as-is with dictionary" may have roughly the same encoding execution time as the "OTLP with key dictionary". I may be wrong, it is worth experimenting with. I posted the benchmark source codes so that anyone is free to play with it and try adding new formats for comparison.
it makes a case to support JSON as-is in OTLP irrespective of any further encoding. This will be helpful for cases where protobuf is not used, for eg., pages that automatically redirect to others give you very little to do anything.
I am not sure I understand this part. Can you elaborate?
it makes a case to support JSON as-is in OTLP irrespective of any further encoding. This will be helpful for cases where protobuf is not used, for eg., pages that automatically redirect to others give you very little to do anything.
I am not sure I understand this part. Can you elaborate?
For short-lived pages - that immediately redirect to another page or pages that have a high chance of users navigating away immediately - we can deliver a lighter version of javascript bundle that includes just one instrumentation and a minimal SDK with only Events support, no batching, no protobuf, to fire off a pageview event that can be exported immediately. Since it's just one event, OTLP JSON should be okay if JSON as-is is not possible for the reasons you outlined.
A few more thoughts on event data as attributes or body:
There will be three kinds of events: OTel-defined events, 3rd-party events, and custom events (added by the application via an API).
Consistency - We have talked about putting the data of a 3rd-party event in the body, and the data of OTel-defined event in attributes. I think this is confusing and inconsistent for backends processing events. If there is such a thing as event data, then it should be in the same place for all events, so that backends do not have to treat different events differently.
Regarding the litmus test of what data goes in event data vs attributes - I think this is independent of where it should go in the data model. For OTel-defined events, we will still need to document what fields are associated with a specific event. I expect that each OTel-defined event will have a document in the semantic conventions documenting its name and the data (e.g. list of fields) that it carries.
IMO, the only confusion of whether to put some data in the body or attributes would happen with custom events. In that case, the end user has control and they can decide where to put their data based on how it is used. If we combine event data with other attributes, and want it to not be confusing, should we not have a concept of event data in the API? But we still need it in the API for 3rd-party events?
I think the main question for me is how events will be processed and stored. Will there be use cases where the backend needs to distinguish between event data and attributes? If yes and we do not separate them in the transport, then we will make this use case impossible, at least for custom events - assuming that a schema could be used to extract event data attributes for OTel events, and for 3rd-party events it would be in the body.
Both in today's 11/12/24 event SIG and the previous 12/8/23 meeting, we just discussed the idea of start with a set of principles and letting the design fall out of that. The idea being that designing without a guiding set of principles can end up feeling subjective, be hard to defend, and lack coherence.
I propose the following design principles, ordered in terms of priority (high priority items first):
From these, the following design falls out. I'm being brief because the characteristics of these things has been discussed elsewhere. The numbers in parenthesis indicate the number of the design principle that informs that decision.
event.raw
or event.source_payload
), such as the the kubernetes defined payload produced by the k8seventsreceiver or the raw payload provided by a browser when a click occurs (probably opt in for users). (5)Here is a counter-proposal, separated into two parts that I think can be decided separately.
Design principles:
Proposal:
event.data_type
and event.schema
define the data type and schema for reading/validating contents of the body field (2)Design principles:
Proposal:
I've not had time to formally enumerate my design principals, but they will be based on my original comment https://github.com/open-telemetry/semantic-conventions/issues/505#issuecomment-1810865899 which we can talk about in the SIG.
I agree with a lot of the above! I'll offer up my approach:
Design principles:
Rather than make a proposal yet (wherein my $0.02 adds a circle to the growing Venn diagram), I want to offer some push back / alternatives to existing/above ideas.
schemaUrl
or xmlns
. Because events are defined inherently within a namespace already (based on their, ahem, name
!), I don't think it's necessary to namespace the fields within an event.We will open new issues related to the outstanding issues here as the original question is now answered.
What's the resolution?
The answer to the question posed Should event fields be part of the global attribute registry?
The answer is No, the payload fields should not be part of the global semantic registry, we are moving forward to define how the payload fields will be defined. https://github.com/open-telemetry/semantic-conventions/issues/755
Part of what is discussed here is whether the event payload should be represented in the log body or as a number of individual attributes. This has already been addressed in https://github.com/open-telemetry/semantic-conventions/pull/566 by specifying that the event payload should be in the log body.
Modeling the event payload is currently being worked on (https://github.com/open-telemetry/semantic-conventions/issues/755 as @MSNev pointed out).
I am closing this issue since it appears the discussion here has been exhausted, and more specific discussions are now happening in other issues. But if anyone wants to continue this conversation here, please feel free to re-open this issue.
This issue is intended as a discussion about the approach of defining event fields.
Background
Events are a new concept in OpenTelemetry that is in the process of being defined. An event will consist of a name and a body (payload). The name will be represented as the event.name attribute. The body payload will typically consist of a collection of fields (name/value pairs). An event may also have additional attributes that are separate from the body.
The discussion about field names in this issue refers to fields that are part of the event body, not the additional attributes. Additional attributes will follow the same rules as attributes on spans.
Discussion
Should fields in the event body be defined in the global attributes registry?
Here is an example event representing a browser page load (from this PR). The field names are not namespaced and their definition would be only in the scope of this specific event.
Pros:
Cons:
Fields defined in the attribute registry
If the fields for the same event were defined in the registry, they could hypothetically look like this:
Note that both
url.full
is an existing attribute that is re-used in this event. However, thebrowser.document.referrer
also contains a full URL, but must have a different name here to distinguish from page URL and also to provide the meaning of the field.Pros:
Cons:
Hybrid
There is a third option where an event payload could contain both fields defined only for the specific event and attributes from the global registry.
Event fields defined externally
There might be cases when it's better to not define the content of a payload in OpenTelemetry semantic conventions, but instead link to an external definition. An example of this might be the data from the W3C ResourceTiming API. In this case, if the field definitions were duplicated in the attributes registry, they may get out of sync with the external definition.
If fields defined externally are used in the event body, they would not be namespaced.