open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.64k stars 870 forks source link

Profiling SIG: No place for decodable auxiliary/binary data for further detailed analysis #3979

Open beaubelgrave opened 2 months ago

beaubelgrave commented 2 months ago

Our proprietary profiling format has space for both sampled data, such as CPU samples, and supporting data, like when each thread was readied (and by whom) among many other things. Several existing documented formats/technologies for profiling allow for this, for example, ETW on Windows allows a profiler to get events about CPU sampling, but can also get non-sampled data like context switches or page faults that are occurring. The Perf kernel system (and the matching CLI tool) on Linux also support both capturing profiling data with supporting information, such as other tracepoints (uprobe, user_events, kprobes, etc) on the system.

Typically, auxiliary comes in at a higher rate and don't have callstacks (although some do). These don't quite match to a sample linked to addresses. These are more like a set of metadata linked to several instances of binary data that can be decoded after collection using the linked metadata.

I expected to see a way to store this type of auxiliary data within the profile format, however, I don't see a way to store binary data with metadata for decoding that supports understanding performance at a deeper level. While this auxiliary data might not be useful for aggregation, it is very useful when a profile comes in that is anomalous and a deep investigation is needed for that specific profile/time period.

For example, we may want to see when CPU samples are hot in a function, which branches are being mis-predicted or if cache-line misses are occurring. These events will not have callstacks or even an associated instruction pointer, it may be solely linked to a CPU at a time, and that data needs to be further mixed with other auxiliary data to truly understand the system.

felixge commented 2 months ago

Thanks for raising this. In my mind this use case sounds similar to the one of preserving JFR events that currently don't map well to the profiling spec. Go execution traces are in a very similar situation.

Right now the most natural "extension point" to transport such data as part of OTel is the original_payload field in the spec. However, we're still having debates in the SIG on how flexible we want to be with this field. Generally speaking the OTel architecture would prefer all data to be explicitly converted into OTLP protobuf messages. However, creating a format suitable to hold a superset of JFR/Go Execution Traces/Microsoft's Proprietary Format/etc. is problematic from a complexity as well as efficiency perspective.

That being said, if you have specific ideas for supporting the data you have in mind, please sketch them out here!

beaubelgrave commented 2 months ago

I agree, it's similar to JFR (and CLR) events that don't map well.

I would prefer, personally, to have this within the OTLP protobuf message a separate section, like the pprof extended is. For efficiency/complexity, I'm leaning toward how Linux has done tracepoints. You have a set of metadata that describes each event (by ID) and then you have data that is simply an event id + payload. A set of metadata (could be as simple as a set of KeyValue objects) just needs to be defined per-id.

Typically, metadata in those events are pretty basic data types (string, UTF-8/UTF-16, u16, u32, u64, s16, s32, s64, etc). They point to an offset within the event binary data that has that type (the type defines also the length). There is one special type in the Linux tracepoint architecture (__rel_loc/__data_loc) that offer a header that handles variable length types (string, char, struct, etc). See this.

An alternative of the metadata -> event ID approach, is an entirely self-defined payload (we use both approaches in our formats). Checkout this.

Depending on the approach, you could get as simple as just an array of byte sequences (EventHeader approach) or a set of KeyValues linked by event ID + byte sequence.

tigrannajaryan commented 2 months ago

We need an input from Profiling SIG on this. Will add to profiling SIG agenda.

felixge commented 2 months ago

@beaubelgrave how do you imagine the proposed encoding to work for JFR or Go Execution Traces? Both formats are emitted by the runtimes, and are heavily optimized for their respective data payloads. Converting them to an alternate representation is going to cause a significant amount of decoding/re-encoding overhead. The resulting data is likely to be less compact than before.

I'm assuming that for the data you're interested in, you can control the code that is producing the data?

beaubelgrave commented 2 months ago

I don't think you change them at all, you just create a "decoder ring" once for those events. The minimal way to achieve this is to first add metadata that describes where JFR or Go has put the various fields of each event you are capturing. Then when those events are captured, you simply append them with the appropriate metadata ID.

We don't often control the payload format, however, we do know these formats details in order to create the metadata. On Linux this can be found via the tracepoint/trace_event definition within tracefs (/sys/kernel/tracing/events). On Windows, this can be determined by the ETW manifest or it can be self-described within the event itself (in that case, we'd need the metadata to state it's using some well-known format instead of metadata field descriptions).

For a concrete example, let's take a look at the sched_waking event on Linux, which tells us when a thread is waking up.

You can get the format from /sys/kernel/tracing/events/sched/sched_waking/format:

name: sched_waking
ID: 404
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:char comm[16];    offset:8;       size:16;        signed:0;
        field:pid_t pid;        offset:24;      size:4; signed:1;
        field:int prio; offset:28;      size:4; signed:1;
        field:int target_cpu;   offset:32;      size:4; signed:1;

In the above, each field has a type, name, offset, and finally size. We only need to capture the above format details once, then when the payload for "sched_waking" is captured, we simply copy the bytes and link it to the above metadata. There is no re-shaping of data.

I'm unfamiliar with JFR and Go events you think would be hard to describe here, but I am familiar with the kernel events and user_events on Linux as well as the CLR runtime events for C#. I believe those can all be described in some metadata block. The hardest case for these is when a dynamically sized array (strings, etc.) are in the middle of the event (instead of the end). The metadata format needs to have the proper language to describe this so it can be decoded properly.

With this approach, the writes/capture is very fast. The decoding can be done later, depending on the complexity of the events, may be costly. Typically runtime events are small and don't have dynamic data within them when they come at high-rates. However, it could be Go or JFR didn't take that path.

@felixge Can you share the details of the events you think would have a hard time describing in this format?

felixge commented 2 months ago

@beaubelgrave ah, I think I understand your idea better now. But I think that even creating this meta data will be challenging because of the following:

  1. Both formats use LEB-128 encoding.
  2. Both runtimes buffer their events in per-thread buffers before flushing them to the underlaying data stream.

That means the data is becoming available to user space in batches of events. To add a metadata ID in front of every event (or in a separate section) requires decoding the batches into individual events. This means that the events have to be LEB128 decoded to some degree, which is not cheap. The next problem is that both formats don't have a specification and the runtime may change the encoding details between minor versions.

JFRs are also already self-describing, it's just really complex.

So I'm a bit unsure whether or not adding a meta layer over these data sources will be a good idea. But I'm open to consider it further!

beaubelgrave commented 2 months ago

Regarding "The next problem is that both formats don't have a specification and the runtime may change the encoding details between minor versions.". I agree on this, perhaps those runtimes should have a stable event documentation like the CLR does.

@felixge Is there an alternative approach you were thinking about?

beaubelgrave commented 2 months ago

One possible approach could be that the metadata can also describe a set of formats. Like it may be able to describe per-event details, but it could also say format = "JFR" and then the byte blob from the per-thread buffers is just copied. For tools that understand "JFR" they could parse it.

While not ideal, it would allow to store a mix of well described data and data that is indescribable with basic field, type, offsets.

felixge commented 2 months ago

perhaps those runtimes should have a stable event documentation like the CLR does.

CLR ETW looks nice. What's the strategy when it comes to runtime internals changing and some events no longer making sense?

I'm asking because both JFR and Go are not specifying their format because it's a maintenance PITA for them when it comes to evolving the internals of the runtime and the format.

@felixge Is there an alternative approach you were thinking about?

Yeah, I was thinking to just use OpenTelemetry as an envelop format for these payloads. So like you suggest above, just say "format = JFR", and then follow it with the raw data. IMO that can be done for the whole recording without worrying about the batches or other format internals.

This puts the burden of interpreting the data on upstream receivers. E.g. for Go there an official library. For JFR similar unofficial (AFAIK) libraries exist.

It's not ideal, but the alternatives are even less appealing IMO. But again, I'm open to ideas.

beaubelgrave commented 2 months ago

perhaps those runtimes should have a stable event documentation like the CLR does.

CLR ETW looks nice. What's the strategy when it comes to runtime internals changing and some events no longer making sense?

I'm asking because both JFR and Go are not specifying their format because it's a maintenance PITA for them when it comes to evolving the internals of the runtime and the format.

@noahfalk, you want to take this? In general, ETW has a version byte that is used to version events if they need to change/append new data. However, I'm unsure about entirely deprecating events.

@felixge Is there an alternative approach you were thinking about?

Yeah, I was thinking to just use OpenTelemetry as an envelop format for these payloads. So like you suggest above, just say "format = JFR", and then follow it with the raw data. IMO that can be done for the whole recording without worrying about the batches or other format internals.

This puts the burden of interpreting the data on upstream receivers. E.g. for Go there an official library. For JFR similar unofficial (AFAIK) libraries exist.

It's not ideal, but the alternatives are even less appealing IMO. But again, I'm open to ideas.

I would like a way for technology that is mature enough to have well described events to be able to represent them clearly in OTel. However, I totally understand the need for some opaque pass-through models as well. I think the metadata format would allow for both. If it's simply passthrough and nothing else, you'd just have a single byte array with a single metadata stating format = JFR. For well described cases, you'd had an array of metadata and binary blobs. I think it can handle both.

noahfalk commented 2 months ago

I'm asking because both JFR and Go are not specifying their format because it's a maintenance PITA for them when it comes to evolving the internals of the runtime and the format.

@noahfalk, you want to take this? In general, ETW has a version byte that is used to version events if they need to change/append new data. However, I'm unsure about entirely deprecating events.

CLR has a couple different approaches to this.

  1. We defined two different providers, a public one and a private one. The public one is intended to be relatively stable, the private one is intended for random internal details that might change at any time.
  2. Our events can be versioned in a back-compatible way by increasing a version number and appending new fields to the end. The reader can ignore trailing fields it doesn't understand.
  3. If the runtime changed in a way to make some old event useless we could stop generating that event and start generating a new one. No examples of us doing this are coming to mind though so I'm guessing it has been rare. I wasn't in charge in the event portion until ~5 years ago so maybe it happened more in the past.

I think there are two different levels to the format. The doc you pointed at Beau are .NET specific semantic conventions defining specific fields and the meaning of those fields for different event types. There is also the Nettrace format which describes how data for arbitrary events gets encoded in a file. This is CLR's platform neutral tracing format that might be the analog of JFR, pprof, or the new format being standardized here. CLR can write the same events into nettrace format on any platform, or write to ETW on Windows/Lttng on Linux.

beaubelgrave commented 4 days ago

There's also the CTF2 open format that could be used to store both the field descriptions (metadata stream) and the actual payloads (data streams).

felixge commented 3 days ago

There's also the CTF2 open format that could be used to store both the field descriptions (metadata stream) and the actual payloads (data streams).

Thanks, I studied CTF1.x at some point and concluded it was too complex. I'll take a closer look at CTF2 to see if it's more approachable. Do you think it would be suitable for your needs?

Also thanks for the information on CLR above, it's very interesting.

beaubelgrave commented 3 days ago

Yeah, CTF1 was pretty complex. CTF2 seems vastly simplified (JSON Metadata, etc). It seems suitable for our needs, yes.

felixge commented 3 days ago

Do you know if there are any standalone encoder/decoder implementations of CTF2?

The Babletrace 2 release notes (Jan 2020) predate CTF2 (March 2024), so I'm unsure if CTF2 is supported these days. I'm also worried about the fact that the project featured 120k lines of C code back then 🙈.

To be considered for OpenTelemetry, there would have to be at least a Go implementation available I think.

beaubelgrave commented 3 days ago

I believe Babletrace2 can load CTF2, I'll ask some efficios people to comment here if they have time.

For the encoder, you could look at barectf, which is geared toward low overhead environments. The output code is very small, but in C.

felixge commented 3 days ago

I believe Babletrace2 can load CTF2, I'll ask some efficios people to comment here if they have time.

Thanks.

For the encoder, you could look at barectf, which is geared toward low overhead environments. The output code is very small, but in C.

Does barectf support CTF2? I found one GH issue that seems to indicate that the answer is no.

compudj commented 3 days ago

Do you know if there are any standalone encoder/decoder implementations of CTF2?

We are actively working on the CTF2 source/sink within the Babeltrace project. The ongoing work is being reviewed and upstreamed into the babeltrace master branch at the moment. It is planned for release in Babeltrace 2.1 around Q4 2024.

The Babletrace 2 release notes (Jan 2020) predate CTF2 (March 2024), so I'm unsure if CTF2 is supported these days. I'm also worried about the fact that the project featured 120k lines of C code back then 🙈.

Babeltrace 2.0 indeed only covers CTF 1.8, because we needed more time to finalize the CTF 2 specification. The final version of the CTF 2 specification was released in March 2024. A few more months are needed to complete the source/sink reference implementation in Babeltrace. We did most of the implementation as the specification was drafted, but wanted to take some time to finalize the Babeltrace API before releasing it.

The 120k lines of C code include all the code needed to expose a C library API, Python bindings, a trace processing graph engine, various filter plugins, and tests. Only a rather small subset is needed if you are only interested in decoding CTF 2 traces. The code to purely parse CTF 2 metadata and decode data streams, excluding Babeltrace 2 specifics, but including all our common C++ utilities, is 21,600 lines of C++ without comments.

To be considered for OpenTelemetry, there would have to be at least a Go implementation available I think.

We are very much open to discuss the possible ways to either re-implement encoding/decoding natively in Golang, or create Babeltrace wrappers to leverage the babeltrace infrastructure from Golang, or both.

A third possibility is a hybrid solution: have a src.ctf.fs query which accepts a CTF metadata file/string and returns a structured program to decode any data stream of the same trace. This could be running as a standalone process distinct from the Go runtime.

Such a program could contain high-level instructions such as (in English here):

• Decode one 32-bit unsigned integer as field id. • Start decoding a UTF-16BE null-terminated string. • Save last integral value to slot #37. • Read value of slot #37 as current variant field tag. • Set current packet sequence number to last integral value. • Set current event record class ID to last integral value. • Start decoding an event record.

and so on.

The main benefit of that approach would be to avoid CTF logic implementation redundancy, including dealing with field class aliases, resolving field locations, and more. Then each language may have its own maximum performance CTF virtual machine without having to interpret the metadata stream.

Note that encoding CTF2 traces from Babeltrace is mainly meant for trace conversion. A rather more important encoding aspect is encoding CTF2 trace data from instrumented applications at high-speed. We have the LTTng and the barectf projects for that purpose. The encoding of CTF2 traces from LTTng is implemented in a feature branch, being reviewed at the moment.

compudj commented 3 days ago

Does barectf support CTF2? I found one GH issue that seems to indicate that the answer is no.

As far as the barectf project is concerned, we are very much interested to extend it to cover CTF2, but it has not been a priority for our sponsors yet.