open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
271 stars 174 forks source link

Should event fields be part of the global attribute registry? #505

Closed martinkuba closed 7 months ago

martinkuba commented 1 year ago

This issue is intended as a discussion about the approach of defining event fields.

Background

Events are a new concept in OpenTelemetry that is in the process of being defined. An event will consist of a name and a body (payload). The name will be represented as the event.name attribute. The body payload will typically consist of a collection of fields (name/value pairs). An event may also have additional attributes that are separate from the body.

The discussion about field names in this issue refers to fields that are part of the event body, not the additional attributes. Additional attributes will follow the same rules as attributes on spans.

Discussion

Should fields in the event body be defined in the global attributes registry?

Here is an example event representing a browser page load (from this PR). The field names are not namespaced and their definition would be only in the scope of this specific event.

{
    "attributes": {
        "event.name": "browser.page_view" 
    },
    "body": {
        "referrer": "",
        "type": "",
        "title": "",
        "url": "",
    }
}

Pros:

Cons:

Fields defined in the attribute registry

If the fields for the same event were defined in the registry, they could hypothetically look like this:

{
    "browser.document.referrer": "",
    "browser.page.type": "",
    "browser.page.title": "",
    "url.full": "",
}

Note that both url.full is an existing attribute that is re-used in this event. However, the browser.document.referrer also contains a full URL, but must have a different name here to distinguish from page URL and also to provide the meaning of the field.

Pros:

Cons:

Hybrid

There is a third option where an event payload could contain both fields defined only for the specific event and attributes from the global registry.

{
    "referrer": "",
    "type": "",
    "title": "",
    "url.full": "", 
}

Event fields defined externally

There might be cases when it's better to not define the content of a payload in OpenTelemetry semantic conventions, but instead link to an external definition. An example of this might be the data from the W3C ResourceTiming API. In this case, if the field definitions were duplicated in the attributes registry, they may get out of sync with the external definition.

If fields defined externally are used in the event body, they would not be namespaced.

MSNev commented 12 months ago

Just to document my take on this and this will include some generic definitions on how "events" should be defined.

By Default for the payload (fields)

Optionally

Optional (additional) Attributes (not the payload fields)

jack-berg commented 11 months ago

I’ve stated my opinion on this before and engaged in lengthy discussions in my Event payload proposal document. I’ll recap my stance and offer some old and new arguments:

Event fields should be part of the global attribute registry and should be subject to the same naming rules as other attributes.

In a nutshell, the discussion comes down to a question of whether the constrained environments (limited network bandwidth, CPU, package size) of certain client scenarios should influence how OpenTelemetry models semantic conventions for Events (and presumably other signals as well). The answer should be no. We should clearly decouple modeling the shape of telemetry (i.e. semantic conventions) from how telemetry is sent over a network. Accepting that Event fields are different from attributes on other Signals would result in:

Another aspect of the discussion is whether including Event fields in the global attribute registry is unnecessarily cumbersome for people looking to contribute Event semantic conventions. Do we want a bunch of domain-specific one-off attributes? Do we want contributors of Event semantic conventions to have to understand the universe of attributes such that they re-use an existing attribute when the concept has the same semantics? Do we want domain experts to have to define their Events in collaboration with semantic convention maintainers? Yes, yes, and yes. The process of defining semantic conventions is intended to be low overhead and accommodate domain specific concepts. Defining Event semantic conventions should be subject to the same process and scrutiny as any other convention. There’s no difference between a group wanting to define domain specific Event conventions and wanting to define domain specific Span conventions. Both require defining how domain specific concepts fit into semantic conventions.

Up to this point, I’ve been expressing my own opinion. However, we talked about this issue in the 11/15/23 TC meeting and there was agreement among the present TC members that Event fields should follow the same process as other attributes defined in semantic-conventions. We discussed that OTLP as currently defined may not be suitable for certain extremely resource constrained environments, and that the Client WG in collaboration with the Event WG should work to prototype changes to the protocol to accommodate these environments. Ideas included various versions of the symbol dictionary, or using a columnar format like the OpenTelemetry Arrow Project, or even a stateful domain specific protocol. Note that this opinion from the TC reflects consistency on a previous vote about whether metric attributes should be namespaced. The last comment on that issue is particularly relevant here:

I think the main decision here was that they are conceptually namespaced and that's how we will describe them in semantic conventions. On this basis, any transmission optimizations can still be built.

An interesting aspect of this problem is that it is not limited in scope to the Client WG, or to Event WG. It involves questions of data modeling that squarely intersect with the Semantic Convention WG, it has implications on how the different telemetry signals interact with each other which is a clear specification level issue which is maintained by the TC, and it has implications on all maintainers since Events will be used outside of the client instrumentation space.

So far, we’ve heard a lot from the perspective of the Client and Event WG, and we’ve discussed this in the TC. I’m interested in whether there are any additional perspectives from other semantic-conventions contributors (other than the maintainers who are also on the TC) and from language maintainers? Are all the arguments on the table or are there additional things we haven’t yet considered?

MSNev commented 11 months ago

Up to this point, I’ve been expressing my own opinion. However, we talked about this issue in the 11/15/23 TC meeting and there was agreement among the present TC members that Event fields should follow the same process as other attributes defined in semantic-conventions. We discussed that OTLP as currently defined may not be suitable for certain extremely resource constrained environments, and that the Client WG in collaboration with the Event WG should work to prototype changes to the protocol to accommodate these environments. Ideas included various versions of the symbol dictionary, or using a columnar format like the OpenTelemetry Arrow Project, or even a stateful domain specific protocol. Note that this opinion from the TC reflects consistency on a previous vote about https://github.com/open-telemetry/semantic-conventions/issues/51 The https://github.com/open-telemetry/semantic-conventions/issues/51#issuecomment-1608853732 on that issue is particularly relevant here:

I think the main decision here was that they are conceptually namespaced and that's how we will describe them in semantic conventions. On this basis, any transmission optimizations can still be built.

I want to document my VERY STRONG disagreement with this approach on several levels that I'm not quite sure where to start...

Strongly agree:

Strongly disagree:

Possible "tweaks" which would make the "logical" prefixing of an event payload "fields"

A "general" OTLP transport improvement(s)

I'm going to (try to) stop now and I'm sure we will (and need) to discuss further in the Event WG (and other SIGs), for reference I have create a draft white paper (not yet public -- being reviewed and may not be able to be fully public) on why Microsoft is not yet using OpenTelemetry for the Web (some of this is summarized above, and I've talked in lots of SIG's about the reasons), this draft internal paper is 17 pages long with all of the issues and what is required.

And for clients breaking things down and trying to address (and / or punt) different concerns down the road doesn't work because they are all inter-related to different degrees because

To Date OpenTelemetry (appears) (IMHO) to have been driven by the "final" storage format used by many organizations and then creating wrappings around these "storage" formats for creating on servers to enable the data to be quickly dumped, and for high volume servers this is an excellent approach -- it just doesn't work for ALL client scenarios / languages.

jack-berg commented 11 months ago

There will be 1000's more "Custom" events then there will be OpenTelemetry defined events and requiring that every "Custom" event MUST have every "field" for the custom event prefixed is just not viable for most applications. This will "force" application custom level events to end up using AnyValue just to avoid the prefix for the EXACT same reasons that I'm pushing back on this for the OpenTelemetry defined events.

No. The attribute naming rules use strong normative language OpenTelemetry authors, but we can't enforce what application authors do for custom events. At best, we can define recommendations. For events defined outside OpenTelemetry and bridged over, we should minimally transform the original payload with mapping rules. The naming rules for are OpenTelemetry semantic conventions.

Bundle sizes in some languages (runtime constraints on the size of the code - ie browsers), strings are not compressable Transport payload serialization is a SUPER CRITICAL time sensitive Prefixing values to runtime defined fields just causes additional memory (and CPU) usage which are then batch (holding onto that in memory) and then during transport serialization this generates further excessive bytes

We shouldn't jeopardize the coherence and clarity of the telemetry semantic conventions for the requirements of extremely constrained environments. We must decouple data modeling from transmission, and work to accommodate constrained environments with transmission level solutions.

Possible "tweaks" which would make the "logical" prefixing of an event payload "fields". All of the payload fields for the events are "grouped" as the payload for an event and when present in the global registry they are "listed" with the prefix, BUT are NOT sent in this manner, they are still SENT (without the prefix) in a payload Attribute (event.data / body)

This is an example of a "transmission optimization" described in this comment, which is compatible with the notion of treating event fields as attributes.

Remove the "need" for every attribute to have it's type (at least for the JSON representation - binary formats will always need to define the "encoding")).

You're describing a breaking change to the OTLP JSON protobuf encoding. A variation which doesn't include the type is possible with a new protocol. This is still possible, since as this comment points out, "any transmission optimizations can still be built".

the specifications drive the internal representations in-memory representation and the code required to support the transport format along with the specification also drives the in-memory representations and the code required to support the combination of the above drives the conversions (DTO's) required to go from the user (application) input, to the enforced specifications (readonly in some cases) and the into a form that can be serialized and the serialization process can then require further "processing"

It sounds like your issues go well beyond the representation of Event fields, bubbling into the SDK specification. The SDK specification are stable of course, and its hard to imagine evolving them in a way that would accommodate the extreme requirements you describe. If the issues exist for all client use cases, it may suggest that a domain specific SDK might need to be developed for the specialized conditions which is not subject to the normal SDK specification language. If the issue only occurs in client application of a certain scale (or under certain constraints), it may be worth considering developing a custom SDK for your specific use case. The OpenTelemetry decoupling of API and SDK was designed for this purpose.

martinkuba commented 11 months ago

My view is the following - I agree that we should decouple the conversation about data modeling and performance overhead. There is no guarantee that attribute names without namespaces will be enough to address performance requirements of some applications. This should be addressed by choosing different exporters/protocols and perhaps even different SDK implementations, and application owners can choose which makes the most sense for their requirements. I also think that we need to just focus on data modeling at this point to make progress.

With that said, I am not 100% convinced that event payload fields need to be in the same global registry as regular attributes. Currently, attribute semantic conventions have a global scope, while event fields are scoped to the particular event. The combination of the event name and a field name should be enough to understand what the field represents. And in order to enable reusability, perhaps it should be allowed to use an attribute from the global registry as an event field.

I want to give one example. Assume that there is an event that represents navigation in a browser environment. We want to capture the previous URL and the current URL. We cannot reuse url.full here because its name does not capture the meaning of those fields (even though the value it holds does). The fields should be something like previouls.url and current.url. Does it add a value to add this to the global registry?

I would like to hear from @open-telemetry/specs-semconv-maintainers and @open-telemetry/specs-semconv-approvers about this topic. If there is a consensus that event fields should be in the global registry from this group and the TC, then I will be in favor as well in order to make progress.

pyohannes commented 11 months ago

Should fields in the event body be defined in the global attributes registry?

If the answer to this is "no", where else would those fields be defined? Wouldn't they be defined at all? Or would there be another event-specific registry independent of the semantic conventions attribute registry?

AlexanderWert commented 11 months ago

I agree with everything that @jack-berg wrote in https://github.com/open-telemetry/semantic-conventions/issues/505#issuecomment-1815289100.

Some additional perspective:

I want to give one example. Assume that there is an event that represents navigation in a browser environment. We want to capture the previous URL and the current URL. We cannot reuse url.full here because its name does not capture the meaning of those fields (even though the value it holds does). The fields should be something like previouls.url and current.url. Does it add a value to add this to the global registry?

@martinkuba In ECS, we have the concept of reusing field definitions in other namespaces to address exactly this use case. For example, we define geo.* fields here and in addition specify in which other namespaces these fields can be nested. So, on a single event you could have something like client.geo.location and server.geo.location to differentiate the context.

We also discussed in the SemConv WG to introduce a similar concept in OTel semantic conventions and have agreement that it's something we want to do. (Next steps is to work on a proposal for the syntax and tooling to realize it). Maybe that would help with the above concern.

scheler commented 11 months ago

A big portion of ECS (and the value it provides to the ECS community) comes from the area of describing events and their payload cross signal and cross use cases.

@AlexanderWert can you please point to an example in ECS describing the fields of payload for a given event type?

AlexanderWert commented 11 months ago

@AlexanderWert can you please point to an example in ECS describing the fields of payload for a given event type?

@scheler Sure, it's mainly defined by the Elastic integrations which are with their documentation pages to some extent comparable to the semantic conventions in OTel.

Let's consider a very concrete example: NGINX Ingress Controller. That page describes in very detail, how an access log from the ingress controller is parsed into ECS fields. In OTel terms, that page describes the semantic conventions for NGINX Ingress Controller logs.

(Please bear in mind that with ECS there's no differentiation between top-level fields, attributes and payload as it is with OTel / OTLP. But, that doesn't make a difference for the below reasoning regarding payload).

In the "Exported fields table" you will find both:

Some inspirations we can get from this example are:

  1. where meaningful, we should use generic semantic conventions fields
  2. sometimes it makes sense to have use-case specific fields
    • BUT: those should still have a meaningful namespace / prefix to be easily searchable / filterable without the need of additional attributes to define the context. (e.g. nginx_ingress_controller.access.upstream.name instead of just upstream.name)
    • AND: even the domain specific fields need to be documented in semantic conventions in a consistent way (as it helps users to better understand and consume data)

As part of the ECS contribution we also hope to contribute semantic conventions and parsers / integrations for many of these integrations to the OTel ecosystem (which will turn the ECS addition into actual, consumable value for the community). The event / logs signal will be crucial for that.

martinkuba commented 11 months ago

If the answer to this is "no", where else would those fields be defined? Wouldn't they be defined at all? Or would there be another event-specific registry independent of the semantic conventions attribute registry?

@pyohannes They would still be defined here, but along with the definition of the event. For every event name, there would be a list of fields that should be included in the payload for that event. I suppose this is similar to how attributes are associated with metric names right now.

martinkuba commented 11 months ago

@AlexanderWert Thank you for the context around ECS. This makes sense, especially if we are looking to adopt existing proven conventions from ECS.

I think based on that it comes down to two questions: where are event-specific fields defined and should they have unique namespaced names?

I see that ECS can combine common fields (akin to OTel global-registry attributes) and fields specific to the integration. The NGINX-specific fields are defined along with the integration, but not in the ECS schema? In the OTel semantic conventions, if we have a document that describes an event with its payload fields, and one of those fields is specific to this event (similar to nginx_ingress_controller.access.upstream.name), does this field need to be defined elsewhere and only referenced here? Or, would we follow the ECS example where the specific fields are defined along with the integration?

BUT: those should still have a meaningful namespace / prefix to be easily searchable / filterable without the need of additional attributes to define the context. (e.g. nginx_ingress_controller.access.upstream.name instead of just upstream.name)

I think the key takeaway for me is that we want all field names to be self-descriptive without the need to know the event name. Is this correct?

The way I have been thinking about this is that the payload of the event is always interpreted in the context of the specific event. We would not be looking at individual payload fields independently. In theory, the payload could contain other type of data than a list of name/value pairs (the LogRecord body is AnyValue). So, interpreting the payload data will always be done in the context of the given event.

MSNev commented 11 months ago

I think based on that it comes down to two questions: where are event-specific fields defined and should they have unique namespaced names?

I agree, based on what Alex is saying about the repetitive use cases into semantic fields that have a common meaning this is exactly what the additional attributes are for. But this is NOT what the payload should be used for.

The payload is as @martinkuba defines above is for the event-specific fields, and it is up to an event definition to identify what specific fields it requires.

tigrannajaryan commented 11 months ago

Performance

So, first, I think in order to consider performance arguments we need to see the data, measurements that compare different approaches. I don't see that data, which makes it impossible to assign relevant weight to the performance argument. I am going to ignore performance aspects for now until we see the supporting data. We can reconsider the design when there is additional evidence from benchmarking.

Design Principles

For now I think we should base our design on the following principles:

Proposal

Based on the principles I believe we should do the following.

For events that originate externally:

For greenfield Otel events:

For all events:

breedx-splk commented 11 months ago

In the interest of not blowing up the conversation too much, I have added some thoughts on this subject here: https://gist.github.com/breedx-splk/14bcc08a8cbcdef5b635cb5d22989ccc Best move is to read it top to bottom (it's not too too long, I promise) but here are the highlights for the distractible and overworked:

That doc doesn't at all address the idea of putting event bodies in different places depending on whether or not they are otel native. That idea is baffling to me. I think having a single consistent place to look for the event body makes way more sense than having it somehow be conditional.

tigrannajaryan commented 11 months ago

@breedx-splk I do not understand the definition of metadata you have and the example you bring.

What about an example event? Let's make up an event for popping a balloon. We decide that the important things about a balloon popping are: balloon diameter

What about metadata? Surely, there could be some supplemental information about this event, like: balloon color

Why is "balloon diameter" an event data and "balloon color" a metadata? What makes "balloon color" a "data about other data" (what other data?), but "balloon diameter" NOT a "data about other data"?

[UPDATE]: Re-reading the proposal again, it seems the separation is between the required data and supplemental (optional) data. Is that a fair litmus test to use to understand the difference between the "event data" and "metadata" in your proposal? If that is not the correct litmus test can you please define one?

martinkuba commented 11 months ago

I think the CloudEvents spec describes this well: attributes are used for context and custom data, while event data (payload/body) contains "domain-specific" information. By domain-specific I understand the data that is inherent to the event - the event can still be interpreted without its attributes but not without its data.

The button click event in @breedx-splk's examples illustrates this well. Information about the button (e.g. its name) specifically describes the event, while screen dimensions or session ID describe the context in which the event occurred.

With that said, I would like to add an argument in favor of separating the event data from other attributes. Systems that interpret events might (very likely) want to handle the event data separately from the other attributes, whether for storage, displaying in a UI, or doing some additional processing. If the event data is combined with other attributes, how would they be distinguished? The consumer would have to be aware of all field names of all events it wants to handle, or all names that are only used for context. I suppose there could be a convention like everything prefixed with the name of the event, but that does not sound like a clean solution.

tigrannajaryan commented 11 months ago

I think the CloudEvents spec describes this well: attributes are used for context and custom data, while event data (payload/body) contains "domain-specific" information.

The CloudEvents' Context has a special purpose. It is serialized independently from event data. I think Otel already has the equivalent to that: we have our own Context. Saying that the Context (whatever it is) should be recorded in the LogRecord Attributes field does not help with the CloudEvents' stated goal of independent serialization and is not necessary.

I do not see "custom data" mentioned anywhere in CloudEvents spec. Please clarify what you refer to.

@martinkuba It would be very useful to see a clear definition that explains the difference between attributes and event data.

You say:

By domain-specific I understand the data that is inherent to the event - the event can still be interpreted without its attributes but not without its data.

Is this the definition of what constitutes event data vs attributes? If it is than I am afraid it is going to be hinged on what "interpreting" means.

In the examples of @breedx-splk's proposal "session.id" is shown as supplementary (so not "inherent" to the event). I can argue both ways: a) "session.id" is not necessary for interpreting the event, b) "session.id" is absolutely necessary for interpreting the event.

Which way I argue depends on what is it exactly that I am doing when I am "interpreting". If I am trying to build the click heatmaps in my UI, I don't need the "session.id", I just need coordinates of the click events. If I am trying to replay a particular user's actions over time I absolutely need the "session.id", I cannot interpret without it.

So, to use the definition that you brought ("necessary for interpreting") I think we need to say more concretely what "interpreting" means.

martinkuba commented 11 months ago

@tigrannajaryan

I do not see "custom data" mentioned anywhere in CloudEvents spec. Please clarify what you refer to.

I was referring to the Extension Context Attributes. By context, I also meant any data that provides additional context about the event, such as the environment.

jack-berg commented 11 months ago

@breedx-splk I read your writeup - thanks for clearly laying out your thoughts. Some comments:

You try to lay out the distinction between the event and meta data with a couple of examples: a balloon, a person, and an example button click event. In all these cases, the distinction between payload and metadata is not obvious:

In all these cases, the difference seems to be somewhat arbitrary. As @tigrannajaryan points out, I think for the distinction to be made between the payload and other attributes, we need a convincing litmus test. So far, the ones I've seen are hand-wavy.

You mention the following:

When handling an entity with a known schema, the handler does not need to try to process unexpected fields or work around unknown types.

How do you reconcile that with the need to be able to evolve event schemas, and add additional fields over time? A consumer may expect version 1.0.0 of the event schema with includes fields foo, bar, but a client might emit events with version 1.1.0 of the schema, which includes foo, bar, baz. Whether the payload is separate from attributes or not, processors will need to be able to handle fields they don't recognize.

The option 5 you present has no listed cons. I can think of a few: Terse fields have the negatives listed here. One thing we discussed in the SIG is the ability to opt into including the original unadulterated click data as provided by the browser in the OpenTelemetry event. The body is the natural place for this as it matches the description from the log data model:

However, a structured body SHOULD be used to preserve the semantics of structured logs emitted by Third-party Applications.

Any solution in which the body is used for the first-party event fields requires another opt-in field in the schema to be able to accommodate this behavior. This is part of what convinced me of @tigrannajaryan's proposal. I now imagine event semantic conventions as each defining a name, set of attributes, and description of where the body comes from (i.e. typically a link to the API used to populate it).

breedx-splk commented 11 months ago

@tigrannajaryan has said

Why is "balloon diameter" an event data and "balloon color" a metadata? What makes "balloon color" a "data about other data" (what other data?), but "balloon diameter" NOT a "data about other data"?

and @jack-berg said something very similar

Why is the balloon diameter and surface material intrinsic to the balloon but the color is metadata?

So the designer of an event schema is tasked with deciding which fields (from the set of all available things in the known universe) belong to this event. For some events, the name may be enough, and the required set of fields is just the empty set. For others, the designer must choose which fields are necessary (required) for for their specific problem domain, which fields are nice to have (optional), and which fields are not included (everything else).

The balloon example is contrived, so I mentioned let's assume that the balloon committee agreed that the above is the important stuff for most purposes.. I didn't explain the reasons why the committee decided (after several long and stressful meetings) to not include the color, but you could imagine some contrived reasons why...

If there needs to be a litmus test for whether something is a core piece of an entity or if it's metadata, I suppose it involves answering these questions:

  1. Is this thing a fundamentally important component of the entity? How do you define fundamentally important? Who knows....philosophers maybe.
  2. Is the entity something less without this component?
  3. Is the entity harder to recognize without this component? Is the entity more recognizable with it?
  4. Is the entity less useful to a consumer without it?
  5. Would a typical consumer be surprised to see this component? Would they be surprised without it?
  6. Is the component critically useful or merely distracting?

I'm not convinced that there exists a singular test that can be performed to make design decisions for data modeling or event design purposes. Design is a craft that requires art, experience, and consideration.

The event designer is deciding what constitutes the event based on the needs of the emitter ("I want to communicate this important thing outward") and the requirements of known receivers ("I want to handle events that have these qualities"). Events should be designed to address something in a problem space.

I think nobody would be surprised to find a coordinate in a mouse click event. They might be surprised though if it was a button click event. Those are different things that serve different purposes. If I receive a person entity and look inside and find a field called lawn_size I will probably laugh...and I've never seen that in the real world...but then again, I've never written software for a landscape architect.

If it does, in fact, turn out that the balloon's color is helpful when processing these pop events, then by all means we should improve our schema design. Similarly, if we think we'll never mess with mylar, let's remove the surface material while we're at it.

breedx-splk commented 11 months ago

@jack-berg said

processors will need to be able to handle fields they don't recognize.

Yeah totally, I agree. I concede that point as a tad flimsy...but there are pathological cases where entities have all this extra stuff that most consumers don't want or need. Heavyweight burdensome cruft. It's nice not to have to deal with that.

The option 5 you present has no listed cons. I can think of a few:

Agreed that those items you called out are applicable to Option 5. I don't agree that all of those things are outright cons, but I understand where you are coming from. I'll try and keep my take short.

I've added a link to your comment as "cons" in my linked gist. Thanks for the input!

MSNev commented 11 months ago

Performance

Formatted response on the performance aspects of this (everyone should have comment access) https://docs.google.com/document/d/1fPMCCCsFFPOq02fZ2OfjFDKlZs7xIxLkhl7G3rnJ5Kg/edit?usp=sharing

MSNev commented 11 months ago

@jack-berg said

processors will need to be able to handle fields they don't recognize.

Yeah totally, I agree. I concede that point as a tad flimsy...but there are pathological cases where entities have all this extra stuff that most consumers don't want or need. Heavyweight burdensome cruft. It's nice not to have to deal with that.

And at Microsoft we do exactly that, we support the concept of "Custom Events" (ie. application derived events), we don't control what "fields" they include, we provide the ability to send a "custom event" and all of the payload fields are extracted out effectively into columns that the application owner can use to query however they like across these custom events. The only "mandatory" field we require is the custom event name as any form of visualization / comparison (generally) only make sense within the context of the same event.name. i.e. It would generally not make sense to compare fields within a balloon pop events with production rate of pins, unless of course if you make both pins and ballons in the same location 😃

jmacd commented 11 months ago

I'll bring in a few additional perspectives to see if it may help.

"Performance concerns" mean different things to different people. I don't need to see anyone's data to be convinced that a smaller encoding is better for performance.

Speaking for the OpenTelemetry Protocol with Apache Arrow project, I think we should get away from using the cost of a "fully-expanded" representation to justify abbreviation. In general, we should not be encoding the name of the attribute with every data item on the wire, whether the name is fully qualified or a short, event-specific field name.

Everyone arguing in this thread could have what they want, if OpenTelemetry would get better about encoding schema information. The schema in today's https://github.com/open-telemetry/otel-arrow does not specialize for the logs Body field having recursive structure, but it could be extended in that direction. What we should be aiming to see, I think, is that only the variables of an event need to be encoded, exported, and sent.

This is what you get using Arrow-based transport. By the time you are exporting that event, it will be decomposed into a number of rows; the schema of each row will be known and transmitted once per stream, and from a performance perspective it won't matter whether:

  1. Field names are fully-qualified or implicitly qualified by event "identity"
  2. Fields are nested in the body
  3. Fields are expressed as attribute values

The OTel-Arrow group would be glad to discuss this; it's something @lquerel experimented with before the project donation.

So, should event fields be part of the global attribute registry? No. I think events should have schemas. We should transmit schemas once per stream, we should transmit variables once per event, and OTel-Arrow will take care of reconstituting the fully-expanded representation at the end of the pipeline, where it's somebody else's performance concerns.

It is a separate concern how schemas should be encoded and registered, and so on. This is currently under discussion in relation to a proposal, also by @lquerel, in Introducing Application Telemetry Schema in OpenTelemetry - Vision and Roadmap and the #otel-weaver channel. We think it would be best if the schema was used to compile strongly-typed interfaces, which would compile directly into an efficient implementation, that would alleviate the performance concerns found here.

Reading the gist @breedx-splk, I find myself with a few more questions.

Timestamp: is relevant, just not under debate, is that correct? Otherwise, is there is a way to order events created by a particular resource?

but for the purposes of this discussion, because they are not relevant, let's drop timestamp and severity,

Session ID: I haven't forgotten https://github.com/open-telemetry/opentelemetry-specification/issues/2500 and I still want to see us store session-id as a non-identifying resource attribute. Again, this is something we could accomplish with more schema information. The present ("weak schema") definition for resource prevents us from adding ephemeral and other descriptive information with cross-cutting applications, and I want to see us fix that.

Option 6: speaking from a past life, I could at least imagine more options than those enumerated in the gist linked above. For option six, I propose to add a couple of things to the OTel log record:

MSNev commented 11 months ago

@jmacd "Performance concerns" mean different things to different people. I don't need to see anyone's data to be convinced that a smaller encoding is better for performance.

100% agree.

And for clients there are many additional concerns which are not currently addressed, and unfortunately something like Apache-Arrow (at this point in time) is a non-starter at many different levels which include

There are also some good words and some concerns in your response as well.

I think events should have schemas 👍

We should transmit schemas once per stream, we should transmit variables once per event

If this means that for every "request" getting sent from a client, then this is problematic and depends on "how" the schema is encoded on the client. Ideally, it's not included and instead it's either "inferred" (by a name) or just referenced url. The other issue is that a "client" (browser specifically) will be sending multiple requests (streams?) per page load and each page load will be completely isolated from each other so there is no "context" of what has already been sent.

OpenTelemetry would get better about encoding schema

Encoding is also problematic, as the "time" required to serialize into said encoding is one of those critical performance concerns for client environments. Ideally, this needs to use a native environment library (for browsers the most common is JSON.stringify()) as there is no faster way to convert an object hierarchy into a serialized form.

Could Arrow be a solution (or part the of) -- maybe, but not today (from what the JS and Client Sigs' have currently seen -- moslty because of that required "bundle" size) so the following comment and the result would be key to that

We think it would be best if the schema was used to compile strongly-typed interfaces, which would compile directly into an efficient implementation, that would alleviate the performance concerns found here.

tigrannajaryan commented 11 months ago

Formatted response on the performance aspects of this (everyone should have comment access) https://docs.google.com/document/d/1fPMCCCsFFPOq02fZ2OfjFDKlZs7xIxLkhl7G3rnJ5Kg/edit?usp=sharing

Thanks @MSNev, this is very useful. I have put together a quick comparison test here: https://github.com/tigrannajaryan/exp-jsevent

It generates synthetic events based on github.com stats you provided. It would be great to modify this to use more real event composition to make sure it is a fair comparison.

So far I have this results:

In original, non-encoded form the flattened events are 85896-32550 = 53346 bytes larger (85896/32550=2.64 times larger) than in nested form (both JSON encoded).

In dictionary-encoded form the flattened events are 23389-21426 = 1963 bytes larger (23389/21426=1.09 times larger) than in nested form (both JSON encoded).

Dictionary encoding time: about 1.7ms.

I am probably the worst JS coder in the world. It would be great if someone who knows what they are doing would take this and turn into a proper benchmark (and verify that I don't have measurement bugs and the comparison is fair).

The question I have: is 53KB (or so) extra RAM, 1.7ms extra encoding time and 9% larger JSON payload (before any gzip compression) a dealbreaker?

scheler commented 11 months ago

@tigrannajaryan where did you run this test? Is it on your laptop? Please see the article on how slow some CPUs are in the mobiles of users around the world.

https://calendar.perfplanet.com/2023/wikipedias-worldwide-web-cpu-benchmark/

tigrannajaryan commented 11 months ago

@scheler it would be great if you can help produce benchmarks on relevant devices (and with more realistic payloads, mine is probably not a good approximation of real world). It will make the performance argument stronger.

lquerel commented 11 months ago

@MSNev Currently, the protocol based on Apache Arrow is not optimal for browser environments due to the current size of the Apache Arrow JS library and the specific lifecycle of web pages, which makes it hard to offset the initial cost of the Arrow schema. In the future, these two issues could be addressed by creating a variant of this protocol for constrained environments like browsers. However, this is a separate project that will take time to develop, so I won't focus on it right now.

As @jmacd highlighted, the proposed Application Telemetry Schema could address various issues discussed in this thread.

1) The Application Telemetry Schema allows importing and referencing semantic conventions and also defining application-specific events and attributes. 2) Once "resolved", an Application Telemetry Schema is self-contained and can be referenced by a URL, which can be used when emitting signals (so the schema itself will never be sent from the browser). 3) A Client SDK generator could be used to create an optimized Client SDK specifically for the application's needs, ideally containing only necessary code without any generic code. 4) Regarding the size of field/attribute names, it's possible to establish a dictionary in the resolved schema to maintain OTel conventions and ease of field interpretation without impacting performance. Browser-emitted messages referencing the schema by URL could use the numeric encoding representation of fields defined in the resolved schema. This approach is compatible with the short lifecycle of web pages, with the dictionary being known in advance and present only server-side.

I believe this approach meets performance constraints while preserving semantic conventions and the need to define fields/attributes that are easily interpretable and rich in metadata.

breedx-splk commented 11 months ago

@jmacd

Timestamp: is relevant, just not under debate, is that correct?

Correct. I just wanted it off the table while we discuss the other bits.

MSNev commented 11 months ago

@lquerel

Currently, the protocol based on Apache Arrow is not optimal for browser environments due to the current size of the Apache Arrow JS library and the specific lifecycle of web pages

This was our (OT Sig's) observation as well from looking at the 1000ft level with the details we had, so it good to see that we didn't miss anything.

Application Telemetry Schema

Yes, I agree, the definition of "events" (especially application custom events), has a very high degree of correlation with what we are trying to do in the Event WG

Do we "need" a schema, we had been going down the initial path of not "requiring" one. Would it be useful as a generic definition -- absolutely. Should the schema be required to be included in the outbound telemetry for events to be useful -- no (on the "assumption" that we have a set of "well known" (and defined) events where the event.name is unique enough to uniquely identify each event (which I believe we have documented this already in the spec), so then backends could reference any published schema to "handle" these well known event definitions.

lquerel commented 11 months ago

@MSNev

Should the schema be required to be included in the outbound telemetry for events to be useful -- no (on the "assumption" that we have a set of "well known" (and defined) events where the event.name is unique enough to uniquely identify each event (which I believe we have documented this already in the spec), so then backends could reference any published schema to "handle" these well known event definitions.

Yes, events without a strict schema are useful as they are. But to further optimize this scenario, and to facilitate interoperability between systems, I think that yes, a schema is useful and not necessarily constraining. This is essentially what has made protobuf or thrift successful at the backend level, and it is also the direction that REST interfaces have taken with OpenAPI (and TypeSpec).

jmacd commented 11 months ago

@MSNev about the use of "performance arguments", I want to try and maintain the following balance:

To say that Arrow is a non-starter, to me, breaks this second principle. No one said you should use an Arrow library to accomplish this. The Arrow data specification combined with OTel-Arrow definitions is what you will use; I expect you to compile a custom event interface that is aware of your schema; I expect you to use WASM, and just as much code as necessary. Also, I won't be surprised if Arrow interfaces become standardized for browser environments. If 1000 bytes is important to you, I expect you have the resources to find a solution to this problem.

tigrannajaryan commented 11 months ago

One more thing regarding the Schema files and Body field.

The Schema files currently allow describing the changes to Attributes field, but not to the Body field. If we put any data in the Body field the schema of which we control we either 1) will have no ability to describe changes that happen to such data or 2) we will need to add the ability to describe changes in the Body field to the Schema File.

The later is additional complication and likely will be more complex than how Schema files describe Attributes (because in Body filed we will allow nesting of fields and it creates more variations of data changes).

MSNev commented 11 months ago

Understand, I think what we are thinking about in terms of schema (specifically along the lines of an application schema), is more akin to the way that Cloud Events describes it for the definition of the payload.

tigrannajaryan commented 10 months ago

@scheler I have rerun the test on the slowest phone that is available to me (Oneplus, 2017 model year) and I get around 3ms of encoding time for that synthetic 121 event batch that I modelled after @MSNev's example from github.com.

Updated benchmarks, with comparison to JSON.stringify, standard OTLP JSON and a new encoding that uses dictionaries: https://github.com/tigrannajaryan/exp-jsevent/blob/main/README.md

Anyone who has access to slower devices feel free to rerun the benchmark. So far the numbers I am seeing look acceptable to me.

scheler commented 10 months ago

@tigrannajaryan thanks for creating the benchmark tests. I checked that Oneplus 2017 is about twice faster than Moto G5 that's used as a reference in the wikipedia cpu benchmark tests, and noting that in South Africa 25% users have devices slower than Moto G5, we can deduce that this encoding could take ~8-10ms on slower devices, compared to 1ms on faster devicds. Martin also pointed out that this is "additional time", on top of the time the original apps already spend, in batching, serializing and sending the content out. People have performance budgets for the whole app and so numbers matter when choosing additional libraries to integrate into their app.

It's interesting you included JSON as-is in your benchmarking tests - it's only slightly larger than "OTLP with key dictionary" when used in the form "Nested, in Body, no namespaces" and also about half the size of OTLP JSON. The payloads will be lot smaller if we use key dictionary on top of JSON as-is. I guess it makes a case to support JSON as-is in OTLP irrespective of any further encoding. This will be helpful for cases where protobuf is not used, for eg., pages that automatically redirect to others give you very little to do anything. I noticed this was discussed in the Event SIG last week - will this be a v2 of OTLP JSON or this will have to be a transport encoding on top of OTLP JSON?

tigrannajaryan commented 10 months ago

It's interesting you included JSON as-is in your benchmarking tests

@scheler yes, I intentionally included it as it is likely the lower bound of what is possible from encoding duration perspective since json.stringify() is just a single call to a native function. The difficulty with as-is encoding is that it essentially fully couples the in-memory representation of events in the SDK with the wire protocol and once published makes difficult/impossible to refactor the SDK's in-memory data models since that will break the wire format. It may be acceptable for hyper-optimized Otel implementations but likely is a no-go for a generally useful Otel SDK and protocol that needs to evolve over time.

The payloads will be lot smaller if we use key dictionary on top of JSON as-is

I suspect "as-is with dictionary" may have roughly the same encoding execution time as the "OTLP with key dictionary". I may be wrong, it is worth experimenting with. I posted the benchmark source codes so that anyone is free to play with it and try adding new formats for comparison.

it makes a case to support JSON as-is in OTLP irrespective of any further encoding. This will be helpful for cases where protobuf is not used, for eg., pages that automatically redirect to others give you very little to do anything.

I am not sure I understand this part. Can you elaborate?

scheler commented 10 months ago

it makes a case to support JSON as-is in OTLP irrespective of any further encoding. This will be helpful for cases where protobuf is not used, for eg., pages that automatically redirect to others give you very little to do anything.

I am not sure I understand this part. Can you elaborate?

For short-lived pages - that immediately redirect to another page or pages that have a high chance of users navigating away immediately - we can deliver a lighter version of javascript bundle that includes just one instrumentation and a minimal SDK with only Events support, no batching, no protobuf, to fire off a pageview event that can be exported immediately. Since it's just one event, OTLP JSON should be okay if JSON as-is is not possible for the reasons you outlined.

martinkuba commented 10 months ago

A few more thoughts on event data as attributes or body:

There will be three kinds of events: OTel-defined events, 3rd-party events, and custom events (added by the application via an API).

Consistency - We have talked about putting the data of a 3rd-party event in the body, and the data of OTel-defined event in attributes. I think this is confusing and inconsistent for backends processing events. If there is such a thing as event data, then it should be in the same place for all events, so that backends do not have to treat different events differently.

Regarding the litmus test of what data goes in event data vs attributes - I think this is independent of where it should go in the data model. For OTel-defined events, we will still need to document what fields are associated with a specific event. I expect that each OTel-defined event will have a document in the semantic conventions documenting its name and the data (e.g. list of fields) that it carries.

IMO, the only confusion of whether to put some data in the body or attributes would happen with custom events. In that case, the end user has control and they can decide where to put their data based on how it is used. If we combine event data with other attributes, and want it to not be confusing, should we not have a concept of event data in the API? But we still need it in the API for 3rd-party events?

I think the main question for me is how events will be processed and stored. Will there be use cases where the backend needs to distinguish between event data and attributes? If yes and we do not separate them in the transport, then we will make this use case impossible, at least for custom events - assuming that a schema could be used to extract event data attributes for OTel events, and for 3rd-party events it would be in the body.

jack-berg commented 10 months ago

Both in today's 11/12/24 event SIG and the previous 12/8/23 meeting, we just discussed the idea of start with a set of principles and letting the design fall out of that. The idea being that designing without a guiding set of principles can end up feeling subjective, be hard to defend, and lack coherence.

I propose the following design principles, ordered in terms of priority (high priority items first):

  1. Symmetry with existing OpenTelemetry signals and concepts
  2. Data modeling is decoupled from the transport
  3. Suitable for different languages, different runtime environments
  4. Intuitive API for users
  5. Intuitive data model for consumers

From these, the following design falls out. I'm being brief because the characteristics of these things has been discussed elsewhere. The numbers in parenthesis indicate the number of the design principle that informs that decision.

martinkuba commented 9 months ago

Here is a counter-proposal, separated into two parts that I think can be decided separately.

Separation of event data and attributes

Design principles:

  1. Consistency: OTel-defined and external events should be handled the same way.
  2. Intuitive / clear design: It should be clear how to understand and process an event without following a complicated set of rules.
  3. Flexibility: the data model should make it possible for backends to process event data separately from other attributes in a clear way. We cannot anticipate all use cases of how backends will handle events. By separating event data from attributes, we make it possible for different use cases of ingesting events to be implemented.

Proposal:

Naming of event data fields for OTel-defined events

Design principles:

  1. Make it possible for data of events to be defined in a way that can meet requirements of specific domains (e.g. minimize data size in client applications).

Proposal:

MSNev commented 9 months ago

I've not had time to formally enumerate my design principals, but they will be based on my original comment https://github.com/open-telemetry/semantic-conventions/issues/505#issuecomment-1810865899 which we can talk about in the SIG.

breedx-splk commented 9 months ago

I agree with a lot of the above! I'll offer up my approach:

Design principles:

Rather than make a proposal yet (wherein my $0.02 adds a circle to the growing Venn diagram), I want to offer some push back / alternatives to existing/above ideas.

MSNev commented 8 months ago

We will open new issues related to the outstanding issues here as the original question is now answered.

jack-berg commented 8 months ago

What's the resolution?

MSNev commented 7 months ago

The answer to the question posed Should event fields be part of the global attribute registry?

The answer is No, the payload fields should not be part of the global semantic registry, we are moving forward to define how the payload fields will be defined. https://github.com/open-telemetry/semantic-conventions/issues/755

martinkuba commented 7 months ago

Part of what is discussed here is whether the event payload should be represented in the log body or as a number of individual attributes. This has already been addressed in https://github.com/open-telemetry/semantic-conventions/pull/566 by specifying that the event payload should be in the log body.

Modeling the event payload is currently being worked on (https://github.com/open-telemetry/semantic-conventions/issues/755 as @MSNev pointed out).

I am closing this issue since it appears the discussion here has been exhausted, and more specific discussions are now happening in other issues. But if anyone wants to continue this conversation here, please feel free to re-open this issue.