open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.71k stars 887 forks source link

Proposal: provide guidance on when to use Event API #3254

Closed jack-berg closed 3 months ago

jack-berg commented 1 year ago

In the 2/22/23 Log SIG we were discussing use cases for the Event API. It was suggested that some of the ideas could be modeled as regular LogRecords with semantic conventions rather than as events. While this is true, I do believe there are scenarios when using the Event API is more appropriate than the other telemetry signals. I think that by clarifying the properties of Events, we'll reduce confusion for users and improve confidence that the Event API is necessary.

How I see Events

To see where Events are appropriate, let's first examine the characteristics of the various telemetry signals:

These signals naturally cover a lot of ground, and if squint at them just right they can cover even more ground. What types of use cases are left for Events? Events are NOT a good fit if:

Events might be the right tool for the job if:

Use cases

What types of use cases fit this criteria? Admittedly, there are less concrete uses for events than spans, metrics, or logs - a testament to the versatility of those signals. However, there are some that don't quite fit into any of those buckets.

There are the original use cases submitted with the Event API otep, which include RUM events, kubernetes events, collector entity events, and mappings from other event systems.

Additionally, events tend to be useful to represent changes of state to a system. Some that come to mind include:

Another set of use cases is to report data made available from the runtime:

Wrapping things up

Let's come to an agreement on when Events are appropriate to use, and provide this and example use cases as supplementary guidance.

Do other folks think differently about when and how to use Events? Are there more concrete use cases you know of? If so, let's discuss. We've got to come a shared understanding and articulate that to users.

breedx-splk commented 1 year ago

Client-side use case: Crash Reports

In client instrumentation, we are working toward creating a data model for things like crash reports. When an application crashes, it is valuable to report back much of the device state for troubleshooting purposes. There is a doc that describes some of the existing vendor data models here.

In addition to obvious things that could be trivially flattened into shallow attributes, some implementations like to send information about all running threads. Each thread has thread details (name, id, stack trace), and thus event model contains an array of thread info objects. This presents a challenge to represent in the existing event data model without serializing to/from String representations of the thread info.

For example, a demonstrative (not-comprehensive) crash report might look something like this, assuming complex/heterogenous attributes:

event {
  timestamp: 1693953342,
  attributes: [
    name: 'crash',
    domain: 'device',
    event.data: {   // map with heterogenous values
      device.name: 'pixel 6',
      battery.percent: 92,
      [...other details],
      threads: [ // list of maps
        {
          id: 123,
          name: 'main',
          is_crash_cause: true,
          stack: '<list of newline delimited stack frames>' // alternately rich object representation
        },
        {
          id: 902,
          name: 'scheduledWorker',
          is_crash_cause: false,
          stack: '<list of newline delimited stack frames>' // alternately rich object representation
        },
        [...]
      ]
    }
  ]
}

Alternatives

The above example event is purely demonstrative -- the actual implementation might be several layers deep, especially if thread call stack details (stack frame, class/module, line number, etc) are available. As an alternative, this might also be flattened, but is much more tedious/clunky:

event {
  timestamp: 1693953342,
  attributes: [
    name: 'crash',
    domain: 'device',
    event.data: {   // map with heterogenous values
      device.name: 'pixel 6',
      battery.percent: 92,
      [...other details],
      thread.count: 4,
      thread.ids: [123, 902, 657, 202], // array containing all thread ids
      thread.names: ['main', 'scheduledWorker', 'clickTrap1', 'gc1'], // array containing all thread names
      thread.stacks: ['<stack1 text>', '<stack2 text>', '<stack3 text>', '<stack4 text>'],
      thread.crash_cause_index: 0,
    }
  ]
}

Like all things, there are pros and cons. Pros includes reducing the number of times attribute key names appear in the output (eg. the string "is_crash_cause" only appears once). Cons includes increased complexity in reassembling the original structure from the flattened wad of arrays.

tedsuo commented 6 months ago

It feels like this issue has been resolved? Events are simply logs that have a consistent name and attributes defined by conventions. That's what we landed on. Given that events are logs, I don't believe there can be any confusion between events and spans.

jack-berg commented 6 months ago

When is it appropriate to use the event API vs. a log API which is bridged into opentelemetry via log appenders? If I'm writing an application, should I consider replacing my log API with the event API?

That's the crux of this issue.

Merging #3858 does help make some things more clear, since one litmus test of whether you use events / logs or metrics or traces to capture a particular thing is whether or not it contains complex structured data. If yes, log record is the only choice, currently.

consistent name and attributes defined by conventions.

I thought event fields weren't attributes? https://github.com/open-telemetry/semantic-conventions/issues/505

tedsuo commented 6 months ago

When is it appropriate to use the event API vs. a log API which is bridged into opentelemetry via log appenders? If I'm writing an application, should I consider replacing my log API with the event API?

That's the crux of this issue.

That is an interesting question! So interesting that I fear it may be a philosophical bike shed. As a point of process, I suggest that we try to stick to practical guidance, and be wary of any deep dive into what constitutes the fundamental nature of an event vs a log. The fact that events are logs in OTel helps us avoid this debate.

I thought event fields weren't attributes? open-telemetry/semantic-conventions#505

Sorry, what I meant to say is that semantically speaking, the rule is that both the attributes and the structure of the event payload must be consistent for each event name. (Events can have attributes in addition to a payload if it is helpful to have that kind of metadata.)

At any rate, the point I would like to make is that the only rule we have for events is consistency. We have no rules at all for logs. Therefore, the practical guidance we could provide is that you should use events when consistency is important to your use case. Otherwise, use logs and do whatever you like. Since our goal for defining semantic conventions is to provide consistency, we always use events for that purpose.

jack-berg commented 6 months ago

Therefore, the practical guidance we could provide is that you should use events when consistency is important to your use case.

👍

So define the properties of the output of the event API, and state that its appropriate to use the event API when these properties fit your requirements:

If these properties fit your requirements, we recommend using the Event API.

To answer the specific question:

If I'm writing an application, should I consider replacing my log API with the event API?

Probably not, since log records produced from the Event API (generally) won't go to stdout / console like you normally expect with logs. (Maybe this is a hint that we should enhance the log SDK so that Event API logs can be routed back to existing log frameworks?)

tedsuo commented 6 months ago

Great, I agree with the details you listed. It accurately defines the features that OTel events provide.

It's an interesting point about the logging pipeline. I expect that users could remove stdout as a log sink on their logger, then write to stdout from the OTel SDK? They could still leave all of their log processors in place. Feels a little confusing and circular to have the SDK loop back out to their logger for events.

tedsuo commented 6 months ago

I created a PR https://github.com/open-telemetry/opentelemetry-specification/pull/3969. @jack-berg I deviated a bit from what you proposed above because I realized many of those features are present on all LogRecords, not just Events.