jack-berg commented 1 year ago

In the 2/22/23 Log SIG we were discussing use cases for the Event API. It was suggested that some of the ideas could be modeled as regular LogRecords with semantic conventions rather than as events. While this is true, I do believe there are scenarios when using the Event API is more appropriate than the other telemetry signals. I think that by clarifying the properties of Events, we'll reduce confusion for users and improve confidence that the Event API is necessary.

How I see Events

To see where Events are appropriate, let's first examine the characteristics of the various telemetry signals:

Traces
- Consist of hierarchically arranged spans which time some unit of work.
- Contain rich, high cardinality data.
- Cross application bounds.
- Have first class support for sampling to limit data volume.
Metrics
- Aggregations of individual measurements.
- By reducing the data footprint, you can represent the whole population of measurements.
- Data footprint is a function of the cardinality of attributes which are recorded. Constrains on cardinality limit attributes to those necessary for analysis.
Logs
- Most commonly a string payload indicating that something happened at a point in time.
- Can be hard to identify a particular class of Event based, sine you're reliant on pattern matching.
- Many useful logs are produced by libraries outside of application's control. User is out of luck if the library doesn't use structured logs or doesn't include the contextual data needed.

These signals naturally cover a lot of ground, and if squint at them just right they can cover even more ground. What types of use cases are left for Events? Events are NOT a good fit if:

The data is timed or hierarchical. Spans are more appropriate.
The volume is data is high and its acceptable to analyze the data in aggregation. Metrics are more appropriate.
Traditional application logs already exist and have sufficient structure and context for analysis.

Events might be the right tool for the job if:

Its useful to retain the entire popular for analysis. Sampling or aggregation would significantly diminish the usefulness of the data.
Its useful to have an unambiguous identifier for querying / alerting. I.e. alert when event.domain = foo and event.name = bar AND attribute.value > 10.
Its use to include high cardinality contextual data. Identifiers, stacktraces, floating points, etc are essential for analysis.
The logs from traditional log libraries lack the structure or data needed for analysis.

Use cases

What types of use cases fit this criteria? Admittedly, there are less concrete uses for events than spans, metrics, or logs - a testament to the versatility of those signals. However, there are some that don't quite fit into any of those buckets.

There are the original use cases submitted with the Event API otep, which include RUM events, kubernetes events, collector entity events, and mappings from other event systems.

Additionally, events tend to be useful to represent changes of state to a system. Some that come to mind include:

Application lifecycle.
- Start / shutdown. It would be useful to have a definitive marker for applications starting up and shutting down. Sure there might be a log you can key off of in many cases, but that will vary by framework and makes dashboarding difficult. The event could include all sort of rich information about the runtime which is too verbose to include in resource attributes. Shutdown events will be tricky from an export timing perspective, but could include details about the cause.
- Environment changes. Some frameworks have key lifecycle state changes that are important to observe. For example, the spring framework has a hook for changing the environment at runtime, and listening for such changes. If such an event occurs, operators would surely be interested in when and what changed.
Changes in data source state.
- Partition reassignment. Various things can trigger the partitions assigned to a kafka consumer to be updated. These events can provide important context while diagnosing issues.
- Database migration. Tools like flyway are useful for ensuring applications are running against the expected version of database schemas. Knowing that an application performed a database migration on startup is useful from an observability perspective.

Another set of use cases is to report data made available from the runtime:

Detect deadlocks. Java provides tools which can be used to detect deadlocks. As prototyped here, these are very actionable events which would be natural candidates for alerts.
Java flight recorder events. Java emits all sorts of events via java flight recorder. Some are best aggregated into metrics (as we do here), but would be useful to report as is. These events are a great fit for the OpenTelemetry data model, with a name identifier and a well defined schema that each follows.
Android runtime events. Splunk has an android instrumentation artifact which may be donated to otel. Amongst the instrumentation it provides are crash detection, slow rendering detection, and network changes, which seem like good candidates for events.

Wrapping things up

Let's come to an agreement on when Events are appropriate to use, and provide this and example use cases as supplementary guidance.

Do other folks think differently about when and how to use Events? Are there more concrete use cases you know of? If so, let's discuss. We've got to come a shared understanding and articulate that to users.

breedx-splk commented 1 year ago

Client-side use case: Crash Reports

In client instrumentation, we are working toward creating a data model for things like crash reports. When an application crashes, it is valuable to report back much of the device state for troubleshooting purposes. There is a doc that describes some of the existing vendor data models here.

In addition to obvious things that could be trivially flattened into shallow attributes, some implementations like to send information about all running threads. Each thread has thread details (name, id, stack trace), and thus event model contains an array of thread info objects. This presents a challenge to represent in the existing event data model without serializing to/from String representations of the thread info.

For example, a demonstrative (not-comprehensive) crash report might look something like this, assuming complex/heterogenous attributes:

event {
  timestamp: 1693953342,
  attributes: [
    name: 'crash',
    domain: 'device',
    event.data: {   // map with heterogenous values
      device.name: 'pixel 6',
      battery.percent: 92,
      [...other details],
      threads: [ // list of maps
        {
          id: 123,
          name: 'main',
          is_crash_cause: true,
          stack: '<list of newline delimited stack frames>' // alternately rich object representation
        },
        {
          id: 902,
          name: 'scheduledWorker',
          is_crash_cause: false,
          stack: '<list of newline delimited stack frames>' // alternately rich object representation
        },
        [...]
      ]
    }
  ]
}

Alternatives

The above example event is purely demonstrative -- the actual implementation might be several layers deep, especially if thread call stack details (stack frame, class/module, line number, etc) are available. As an alternative, this might also be flattened, but is much more tedious/clunky:

event {
  timestamp: 1693953342,
  attributes: [
    name: 'crash',
    domain: 'device',
    event.data: {   // map with heterogenous values
      device.name: 'pixel 6',
      battery.percent: 92,
      [...other details],
      thread.count: 4,
      thread.ids: [123, 902, 657, 202], // array containing all thread ids
      thread.names: ['main', 'scheduledWorker', 'clickTrap1', 'gc1'], // array containing all thread names
      thread.stacks: ['<stack1 text>', '<stack2 text>', '<stack3 text>', '<stack4 text>'],
      thread.crash_cause_index: 0,
    }
  ]
}

Like all things, there are pros and cons. Pros includes reducing the number of times attribute key names appear in the output (eg. the string "is_crash_cause" only appears once). Cons includes increased complexity in reassembling the original structure from the flattened wad of arrays.

tedsuo commented 7 months ago

It feels like this issue has been resolved? Events are simply logs that have a consistent name and attributes defined by conventions. That's what we landed on. Given that events are logs, I don't believe there can be any confusion between events and spans.

jack-berg commented 7 months ago

When is it appropriate to use the event API vs. a log API which is bridged into opentelemetry via log appenders? If I'm writing an application, should I consider replacing my log API with the event API?

That's the crux of this issue.

Merging #3858 does help make some things more clear, since one litmus test of whether you use events / logs or metrics or traces to capture a particular thing is whether or not it contains complex structured data. If yes, log record is the only choice, currently.

consistent name and attributes defined by conventions.

I thought event fields weren't attributes? https://github.com/open-telemetry/semantic-conventions/issues/505

tedsuo commented 7 months ago

When is it appropriate to use the event API vs. a log API which is bridged into opentelemetry via log appenders? If I'm writing an application, should I consider replacing my log API with the event API?

That's the crux of this issue.

That is an interesting question! So interesting that I fear it may be a philosophical bike shed. As a point of process, I suggest that we try to stick to practical guidance, and be wary of any deep dive into what constitutes the fundamental nature of an event vs a log. The fact that events are logs in OTel helps us avoid this debate.

I thought event fields weren't attributes? open-telemetry/semantic-conventions#505

Sorry, what I meant to say is that semantically speaking, the rule is that both the attributes and the structure of the event payload must be consistent for each event name. (Events can have attributes in addition to a payload if it is helpful to have that kind of metadata.)

At any rate, the point I would like to make is that the only rule we have for events is consistency. We have no rules at all for logs. Therefore, the practical guidance we could provide is that you should use events when consistency is important to your use case. Otherwise, use logs and do whatever you like. Since our goal for defining semantic conventions is to provide consistency, we always use events for that purpose.

jack-berg commented 7 months ago

Therefore, the practical guidance we could provide is that you should use events when consistency is important to your use case.

👍

So define the properties of the output of the event API, and state that its appropriate to use the event API when these properties fit your requirements:

Events occur at a specific point in time
Events are optionally associated with a trace / span
Events are not aggregated (i.e. unlike metrics) and are not hierarchical (i.e. unlike spans)
Events are built on top of logs, and share aspects of the log data model including severity, AnyValue body, etc
Events are recorded in a way the promotes consistency. You should be able to quickly find all events with the same name (i.e. type / class), and benefit from all events with the same name being structurally similar.
Events route through the Otel log SDK, which lacks features out of the box that popular log frameworks are expected to have (pattern logging, file rotation, rich set of network appenders, etc)

If these properties fit your requirements, we recommend using the Event API.

To answer the specific question:

If I'm writing an application, should I consider replacing my log API with the event API?

Probably not, since log records produced from the Event API (generally) won't go to stdout / console like you normally expect with logs. (Maybe this is a hint that we should enhance the log SDK so that Event API logs can be routed back to existing log frameworks?)

tedsuo commented 7 months ago

Great, I agree with the details you listed. It accurately defines the features that OTel events provide.

It's an interesting point about the logging pipeline. I expect that users could remove stdout as a log sink on their logger, then write to stdout from the OTel SDK? They could still leave all of their log processors in place. Feels a little confusing and circular to have the SDK loop back out to their logger for events.

tedsuo commented 7 months ago

I created a PR https://github.com/open-telemetry/opentelemetry-specification/pull/3969. @jack-berg I deviated a bit from what you proposed above because I realized many of those features are present on all LogRecords, not just Events.

open-telemetry / opentelemetry-specification

Proposal: provide guidance on when to use Event API #3254

How I see Events

Use cases

Wrapping things up

Client-side use case: Crash Reports

Alternatives