open-telemetry / oteps

OpenTelemetry Enhancement Proposals
https://opentelemetry.io
Apache License 2.0
326 stars 157 forks source link

Introduces Profiling Data Model #237

Open petethepig opened 10 months ago

petethepig commented 10 months ago

OTel Profiling SIG has been working on this for a while and we're ready to present our Data Model. So far we've done a number of things to validate it:

We're seeking feedback and hoping to get this approved.


For (a lot) more details, see:

lizthegrey commented 10 months ago

Promising work so far. I am curious what input the go pprof team has on this design for the data format / whether they'd consider native export in this format - if not, why not?

mtwo commented 9 months ago

@aalexand FYI

felixge commented 9 months ago

Promising work so far. I am curious what input the go pprof team has on this design for the data format / whether they'd consider native export in this format - if not, why not?

@mknyszek who works on the Go runtime team joined the Dec 1st 2022 meeting (notes) to give his perspective. He spoke under the disclaimer that he was giving his best guesses, as any decisions would be made by the Go team following their usual processes, but the summary was:

  1. Go is doubling down on pprof as part of the new PGO compiler feature.
  2. The runtime is unlikely to be an early adopter of any new profiling format.
  3. As a workaround, the runtime might consider proposals that would expose profiling data in a more structured format than what is currently being offered. This would allow profiling client libraries to directly export profiling data in any format they want rather than having to go through pprof as an intermediate data format.

For Java we didn't get an official representative to give a statement, but IIRC several Java experts that joined the discussions felt that Java is highly unlikely to ever support a profiling/observability data format other than JFR natively.

Let us know if you have more questions on this. It's definitely an important topic.

rsc commented 9 months ago

Go and pprof are already using a fairly general protobuf-based format. The original, careful design of that format was informed by over a decade of experience with earlier formats, and since then it has had a second decade of refinement in response to real-world use. Inside Google it replaced two competing formats ("old pprof" and "cpprof"), which was an important win for us but also a testament to its overall generality.

It is unclear to me at a glance why OpenTelemetry is defining a new general protobuf-based format instead of using or adapting pprof's. It would be helpful if someone could summarize that, especially since so much of the OpenTelemetry ecosystem overlaps with the Go ecosystem, where the pprof format is well established.

In fact, the OpenTelemetry proto seems obviously forked from the Google proto, although without any acknowledgement and with the Google copyright removed. That only makes the question even more relevant.

What specific design decisions in the pprof format does OpenTelemetry need to be different and why? If there are just a few small issues, then it might make sense to discuss adjusting the pprof format instead of defining a whole new format. As @felixge mentioned below, pprof has a non-trivial ecosystem of profiling tools associated with it, and OpenTelemetry would strongly benefit by joining and growing that ecosystem instead of building an isolated, separate ecosystem.

petethepig commented 9 months ago

I'd like to address some questions and concerns that have been raised about using the pprof format for OTLP profiles:

Divergence of Objectives:

First, it's important to recognize the differing objectives between the pprof format and the proposed OTLP profiles format:

Why Not Use pprof Format for OTLP Profiles?

There are both technical and non-technical reasons behind this decision:

Technical Reasons:

Support for Linking Between Signals: While it is possible to express links between signals in pprof using labels, the format is not optimized for this use-case within the broader telemetry ecosystem.

Support for Timestamps: Similar to the previous point, pprof can handle timestamps, but doing so efficiently would require representing each timestamp as a separate Sample, which is inefficient due to the format's lack of optimization for this use-case. For more context, see this PR from @felixge.

Miscellaneous Improvements: The OTLP profiles format introduces several improvements, such as the use of arrays-of-integers instead of arrays-of-structs representation (which reduces memory allocations), the use of indices instead of IDs to refer to structs (eliminating confusion and reducing payload size), and the use of semantic conventions to define profile types and units, among others.

Non-Technical Reasons:

There are also non-technical considerations at play:

Interoperability and Backwards Compatibility: Interoperability is a key strength of pprof. Tools that read pprof files from a decade ago still work with files generated today and vice versa. Introducing changes to pprof to align with OTLP's objectives could disrupt this longstanding interoperability.

Ability to Make Changes to the Format: While using the same format would be ideal, the divergence in objectives between pprof and OTLP profiles could hinder effective collaboration and progress in introducing profiling support to OpenTelemetry.

Additional Considerations

While this was not directly mentioned in the comments, I figured it's worth describing here as these are pprof-related concerns that were mentioned in the past:

Interoperability between OTLP and pprof

One of the stated requirements in this OTEP is that existing profiling formats (pprof) can be unambiguously mapped to this data model. This means that existing tools that generate pprof files are automatically compatible with any OTLP backend.

It works in the opposite direction as well — any OTLP backend should be able to generate pprof files, albeit some functionality (like timestamps support) may be limited. This is important because it allows users to use existing tools to visualize profiles generated by OpenTelemetry.

Performance implications in Go ecosystem

There are concerns that using OTLP profiles will have a negative impact on performance in the Go ecosystem. We discussed this with Go runtime team in one of the meetings and the direction we landed on is that the Go runtime could provide lower level APIs that would allow us to avoid the overhead of parsing pprof and encoding it into OTLP. FWIW There's already APIs like that for memory profiles.

Addressing Copyright Concerns:

I acknowledge the concerns about copyright regarding the OpenTelemetry proto, which is based on the Google proto. We have been transparent about this relationship from the beginning, and it has been discussed openly in our group discussions and Profiling SIG Meeting Notes. I apologize for the oversight in not providing an official copyright notice in the proto file, and I have opened a pull request to address this issue.


Hope this helps clarify the rationale behind the decision. Please feel free to reach out if you have any questions or concerns. I also updated the OTEP itself to include this information.

cc @rsc, @lizthegrey, @felixge

morrisonlevi commented 9 months ago

What specific design decisions in the pprof format does OpenTelemetry need to be different and why?

I was not involved in this proposal, but here is my take:

One deficiency of Go's pprof format is that Samples hold location ids in an array and it is not deduplicated. This means if you have the same array of location ids for multiple samples, and they differ only on labels, then there will be significant waste. There are many labels that could cause this to happen: span ids, thread ids, timestamps, etc. You can see from the proposed model's Stacktrace that this doesn't happen in the proposal.

Another deficiency is that Go's pprof uses 64-bit ids when 32-bit ids would definitely suffice... at least for when you have to be generating some kind of id. I would guess that in Go, many of these ids are not generated but are instead stable memory addresses, so picking a 64-bit id wouldn't cost anything from a CPU perspective and the extra memory was considered fine. For some languages, I can imagine this is not possible because the runtime doesn't guarantee the respective addresses to be stable or perhaps live long enough for aggregation. Again, I wasn't involved in this proposal but I can see it went the route of eliminating "ids" generally, and instead having them implicit in the structure (the index the item lives at in the table is its id, just like pprof's string table).

Given these two changes and ignoring the others, I already don't see how Go would adopt this. Tools could handle new fields, but the structures are too different to be feasible, I think. At least from my perspective, these two changes are well deserved and are already enough reason to fork when multiplied by the potential impact across languages, but if you wanted to convince me otherwise and think it's too noisy for here, send me an email. I'm happy to hear more opinions as well as technical arguments.

aalexand commented 9 months ago

this PR from @felixge

One deficiency of Go's pprof format is that Samples hold location ids in an array and it is not deduplicated. This means if you have the same array of location ids for multiple samples, and they differ only on labels, then there will be significant waste. There are many labels that could cause this to happen: span ids, thread ids, timestamps, etc. You can see from the proposed model's Stacktrace that this doesn't happen in the proposal.

Not perfect and has compatibility caveats, but at one point something like this was discussed to address this.

felixge commented 9 months ago

@petethepig thanks for the detailed response!

Why Not Use pprof Format for OTLP Profiles?

There are both technical and non-technical reasons behind this decision:

Technical Reasons:

Support for Linking Between Signals: While it is possible to express links between signals in pprof using labels, the format is not optimized for this use-case within the broader telemetry ecosystem.

What do you mean by "optimized for this use case"? I assume it's about making the signal correlation link more clear to implementers?

Support for Timestamps: Similar to the previous point, pprof can handle timestamps, but doing so efficiently would require representing each timestamp as a separate Sample

You can use a single timestamps: [10, 20, ...] pprof label that assigns multiple timestamps to the same Sample without breaking it up.

which is inefficient due to the format's lack of optimization for this use-case. For more context, see this PR from @felixge.

Having first-class support for timestamps in the schema is nice 👍 . But I closed my PR above because I concluded that optimizing the timestamp representation in pprof wasn't big enough of a win (especially after gzip) over labels to justify the compatibility drawbacks. Your benchmarks seem to have replicated this finding?

Miscellaneous Improvements: The OTLP profiles format introduces several improvements, such as the use of arrays-of-integers instead of arrays-of-structs representation (which reduces memory allocations),

Can you elaborate on the reduced memory allocations? In my experience it's possible to encode pprof with zero allocations. Or are talking about decoding now?

the use of indices instead of IDs to refer to structs (eliminating confusion and reducing payload size)

Using indices is nice, I agree. But I don't think it matters much for payload size?

and the use of semantic conventions to define profile types and units, among others.

That could be done with pprof as well, right?


I also really like @aalexand's comment for summarizing the differences. I certainly like a lot of them.

Looking forward to more comments from other folks here!

rsc commented 9 months ago

Thanks for the details about pprof vs OLTP. I want to clarify one thing: while Go uses pprof and has done a lot to popularize it, pprof is not "Go's format". At Google, C++ and Java make extensive use of this format as well. The team that maintains pprof primarily works on C++ performance, not Go performance. It's just a good language-neutral format. The Go and pprof teams collaborate when the profile format changes, as of course any two teams sharing a format must, but it's not tied to Go.

The technical details seem addressable if we wanted to collaborate and converge. As one of the co-designers of the current pprof format and the author of the Go pprof runtime support, I think there's a real opportunity here to converge to a single format instead of having to juggle two formats. Both OpenTelemetry and pprof would benefit from interoperability with the other's tooling.

I don't think I understand why the objectives distinctions or non-technical reasons are blockers either.

For the objectives:

On the other hand, OpenTelemetry has a broader mission, encompassing the collection and management of telemetry data across various languages and platforms. This mission emphasizes cross-linking between various signals (traces, metrics, logs, profiles), vendor-neutrality, and extensibility. OpenTelemetry aims to provide not only the wire format and a reference implementation for one runtime but also comprehensive tooling for various languages and runtimes, including client SDKs and collectors.

The pprof format is already used by multiple languages and platforms, and it too has hooks like labels for various cross-linking and extensibility. I understand that OpenTelemetry provides code as well, like collectors and SDKs, but that does not seem like a fundamental reason the code can't use the pprof format.

It's important to note that pprof primarily serves as a file format, with users directly interacting with pprof files. In contrast, OTLP serves as a wire format, and users don't interact directly with OTLP data. Moreover, OTLP format may evolve over time without affecting user experience. OTLP profiles are not intended to replace pprof. Instead, they serve as a means to transport profiling data, linked to other signals, across the wire.

Are you saying here that no OpenTelemetry users would ever store or load an OpenTelemetry profile from a file? They'd never send one from one developer to another as part of diagnosing a problem, and so on? That this profile format is somehow a completely invisible internal detail of OpenTelemetry? I can't reconcile that with the "broader mission" part. And of course the discussion about the Go runtime generating this format natively means that it can't evolve incompatibly without affecting user experience, at least not users on an incompatible version of Go.

It still seems like there would be strong ecosystem reasons to arrange to use a single format. Then runtimes like Go wouldn't have to choose between Pprof and OpenTelemetry. They could just generate the one standard format. That's a win for everyone.

For the non-technical considerations:

Interoperability and Backwards Compatibility: Interoperability is a key strength of pprof. Tools that read pprof files from a decade ago still work with files generated today and vice versa. Introducing changes to pprof to align with OTLP's objectives could disrupt this longstanding interoperability.

Any changes would have to be made in a backward compatible way to provide for a graceful transition, and we do want 2023 pprof to read 2013 profiles, but it is absolutely not a requirement for 2013 pprof to read 2023 profiles. I'm not even sure that's true today.

Ability to Make Changes to the Format: While using the same format would be ideal, the divergence in objectives between pprof and OTLP profiles could hinder effective collaboration and progress in introducing profiling support to OpenTelemetry.

As I noted above, once you have collectors generating this format that may be out of sync with the rest of the tooling, the format is going to have to pay some respect to backwards compatibility in how it evolves. This is true whether the format is shared or not shared with pprof.

For the general pprof notes:

Interoperability between OTLP and pprof One of the stated requirements in this OTEP is that existing profiling formats (pprof) can be unambiguously mapped to this data model. This means that existing tools that generate pprof files are automatically compatible with any OTLP backend. It works in the opposite direction as well — any OTLP backend should be able to generate pprof files, albeit some functionality (like timestamps support) may be limited. This is important because it allows users to use existing tools to visualize profiles generated by OpenTelemetry.

This part seems to say that pprof can be converted to OpenTelemetry losslessly and the reverse as well, perhaps with some loss of OpenTelemetry-specific information. That implies they're already very close (which they are). If instead we can arrange to share one format, all this conversion doesn't have to happen at all.

Performance implications in Go ecosystem There are concerns that using OTLP profiles will have a negative impact on performance in the Go ecosystem. We discussed this with Go runtime team in one of the meetings and the direction we landed on is that the Go runtime could provide lower level APIs that would allow us to avoid the overhead of parsing pprof and encoding it into OTLP. FWIW There's already APIs like that for memory profiles.

That's an option but again if there was one profile format, all that complexity goes away. And not just for Go but for any language that can already generate pprof format. I'm also skeptical about the lower-level APIs. In general memory profiles tend to be much smaller than CPU profiles, so it is less problematic to provide that data-structure-based API for memory profiles. Maybe it works out now that memories are so large, but it still seems like a waste, especially compared to just generating it directly with the same code that generates pprof.

lizthegrey commented 9 months ago

Just wanted to say that I'm really glad this conversation is happening, thank you @rsc and @petethepig.

tigrannajaryan commented 9 months ago

It still seems like there would be strong ecosystem reasons to arrange to use a single format. Then runtimes like Go wouldn't have to choose between Pprof and OpenTelemetry. They could just generate the one standard format. That's a win for everyone.

@rsc I think this is a great suggestion and can be definitely a win for everyone. We would need to find the right way to make this happen while staying true to OpenTelemetry's vendor-neutrality mission. This is likely in the purview of Otel's Technical Committee and Governance Committee. It would be great to meet and talk if you are interested.

thomasdullien commented 9 months ago

Hey all,

First off, apologies for the long silence -- I was largely out-of-commission for most of this year due to family/health related issues, and was not involved much, so sorry if I am missing some details.

I'd like to chime in on the 'why not use pprof' discussion. In general, I think that the idea of an industry-wide profiling format (that is both good on the wire and good on disk) is a nice one.

That said, the first thing to be clear is about "what that means (for Go)": Profiling serves many different runtimes (often in the same process), and Go is one. There are also use-cases that profiling supports that Go's pprof will not support without additional information (calculating inferred spans from high-frequency profiling stack traces being one example).

So the obvious organisational question that crops up: How would the co-evolution of a format that serves many customers work?

On the technical side, there are a number of things that are important to have in a format that pprof does not currently provide (FWIW, the OTel draft does not provide all of them either yet, as we're still in discussions around it):

(Note before I start: I am not an expert in Pprof, so @rsc and @felixge and everybody else please correct me on any points in which I misunderstand the format)

  1. Support for efficient high-fidelity timestamps. This is important for a variety of latency-relevant use cases (inferring spans being one of many). Those should be a first-order citizen (e.g. placing them in "Label" seems like a terrible idea), but optional.
  2. Support for "location type"/"frame type": If you profile any mixed workloads (e.g. Python calling into native libraries, NodeJS calling into native libraries etc.) you will by necessity require the ability to distinguish between the different runtimes from which you've obtained stack frames. If you look at the kernel commits Meta has been doing, it's pretty clear that they have mixed stack traces for their PHP/HHVM workloads in their tooling (btw, do we have someone from Meta in the discussion?)
  3. The BuildID in the Mapping proto needs to be changed to be a sort-of-hash of parts of the binary. There are too many examples out there that abuse the ELF BuildID in a way that kills any (even probabilistic) notions of uniqueness (or presence, see Alpine). My ideal perspective is we standardize the way to calculate it from a given ELF mapping (my suggestion is to hash the ELF header plus a few deterministically-randomly selected pages).
  4. Having a way of de-duplicating identical stack unwinds would be ideal, as @morrisonlevi mentioned. Particularly when doing high-frequency sampling, it is pretty common to have many samples that end up identical (even more so if you'd ignore the leaf frame of course).

Lastly, there is the "wire format vs file format" question: The main difference will be "streaming vs bulk data", and it should be kept in mind that some folks want to stream profiling data continuously from machines, perhaps aggregated for 2-3 seconds on each machine. Is the proposal to just send a pprof file blob? Wouldn't the overhead of resending all the mappings every 2-3 seconds be a bad idea? Are there other side effects from file-format-vs-wire-format that we haven't thought about?

We've had customers that were pretty continuously mapping / unmapping different ELF files, with thousands of mappings in an address space and ~4GB+ of text, and stack traces spanning many files.

My (possibly unfounded) worry is that pprof makes a very Google/Go-specific assumption: Stack traces that do not span very large numbers of executables. This is true if you assume a toolchain like Google/Go's that strongly favors a few enormous binaries, it is not true in other scenarios (the most brutal case would be a system where someone uses fine-grained ASLR, but I'll admit that's a very extreme scenario).

There's also an argument to be made that details about the memory mappings are actually pretty uninteresting (aside from "what offset of the original text section was the sample collected"), so they could be optional?

(Also apologies if not all of my worries are addressed by the current Otel draft -- me & my team argued vehemently for the stateful approach, and the shift to non-stateful happened while I was out)

felixge commented 9 months ago

@thomasdullien thanks for your comments. Here are some thoughts from my end:

  1. Support for efficient high-fidelity timestamps. This is important for a variety of latency-relevant use cases (inferring spans being one of many). Those should be a first-order citizen (e.g. placing them in "Label" seems like a terrible idea), but optional.

Using labels for timestamps is definitely sub-optimal, and it's definitely a weakness of pprof today. But after gzip compression it's not as terrible as one might expect. See Efficiency section in this PR.

Anyway, based on the discussion above, it doesn't seem out of question to evolve pprof in this direction.

  1. Support for "location type"/"frame type": If you profile any mixed workloads (e.g. Python calling into native libraries, NodeJS calling into native libraries etc.) you will by necessity require the ability to distinguish between the different runtimes from which you've obtained stack frames. If you look at the kernel commits Meta has been doing, it's pretty clear that they have mixed stack traces for their PHP/HHVM workloads in their tooling (btw, do we have someone from Meta in the discussion?)

👍 This would probably be a very easy change for pprof. I'd guess putting this info on the Mapping would make sense.

  1. The BuildID in the Mapping proto needs to be changed to be a sort-of-hash of parts of the binary. There are too many examples out there that abuse the ELF BuildID in a way that kills any (even probabilistic) notions of uniqueness (or presence, see Alpine). My ideal perspective is we standardize the way to calculate it from a given ELF mapping (my suggestion is to hash the ELF header plus a few deterministically-randomly selected pages).

👍

  1. Having a way of de-duplicating identical stack unwinds would be ideal, as @morrisonlevi mentioned. Particularly when doing high-frequency sampling, it is pretty common to have many samples that end up identical (even more so if you'd ignore the leaf frame of course).

👍 Yes. But Gzip seems to do a fairly good job "fixing" this problem as well.

Lastly, there is the "wire format vs file format" question: The main difference will be "streaming vs bulk data", and it should be kept in mind that some folks want to stream profiling data continuously from machines, perhaps aggregated for 2-3 seconds on each machine. Is the proposal to just send a pprof file blob? Wouldn't the overhead of resending all the mappings every 2-3 seconds be a bad idea? Are there other side effects from file-format-vs-wire-format that we haven't thought about?

I think this proposal implicitly assumes that clients will typically send profiling data every ~60s, but doesn't mention it anywhere. @petethepig maybe this could be added to the OTEP?

Generally speaking, the smaller the profiling interval, the better the efficiency. Using a profiling period of 2-3s would be very inefficient and arguably not practical under the current proposal.

We've had customers that were pretty continuously mapping / unmapping different ELF files, with thousands of mappings in an address space and ~4GB+ of text, and stack traces spanning many files.

That's interesting. I haven't encountered such a situation before. If you think it's a common problem, could you share a bit more about this?

There's also an argument to be made that details about the memory mappings are actually pretty uninteresting (aside from "what offset of the original text section was the sample collected"), so they could be optional?

In my experience pprof tools only use the mapping for displaying profiling info on an instruction level. But that's an optional feature as it only works if the binary is available. So I think clients can certainly omit Mapping details if they want.

tigrannajaryan commented 9 months ago

It still seems like there would be strong ecosystem reasons to arrange to use a single format. Then runtimes like Go wouldn't have to choose between Pprof and OpenTelemetry. They could just generate the one standard format. That's a win for everyone.

@rsc I think this is a great suggestion and can be definitely a win for everyone. We would need to find the right way to make this happen while staying true to OpenTelemetry's vendor-neutrality mission. This is likely in the purview of Otel's Technical Committee and Governance Committee. It would be great to meet and talk if you are interested.

OpenTelemetry Technical Committee discussed this topic today. Here are the decisions made:

thomasdullien commented 9 months ago

Quick warning: this post is written in semi-brainfried state. May need to correct myself later. I'll also follow up with a separate post that outlines some concerns I have about the general design we're at now.

Using labels for timestamps is definitely sub-optimal, and it's definitely a weakness of pprof today. But after gzip compression it's not as terrible as one might expect. See Efficiency section in this PR.

Oh, shiny, that's a super valuable PR and discussion. Thanks!! My aversion to labels-for-timestamps is one of type safety; I think "labels for timestamps" is a hack at best. I'm also sceptical of the "let gzip sort it out" approach, more on this in a separate reply later. There's a political consideration, even: I consider timestamps to be a sufficiently important feature that if it shows that we can't get that into pprof, we probably don't want to hitch our wagon to that train :-)

Anyhow; I'll read the thread you posted in more detail, and mull this over a bit. More in a future reply.

Anyway, based on the discussion above, it doesn't seem out of question to evolve pprof in this direction.

True, and I think that this will be key.

  1. Support for "location type"/"frame type": If you profile any mixed workloads (e.g. Python calling into native libraries, NodeJS calling into native libraries etc.) you will by necessity require the ability to distinguish between the different runtimes from which you've obtained stack frames. If you look at the kernel commits Meta has been doing, it's pretty clear that they have mixed stack traces for their PHP/HHVM workloads in their tooling (btw, do we have someone from Meta in the discussion?)

👍 This would probably be a very easy change for pprof. I'd guess putting this info on the Mapping would make sense.

I don't know - Mapping pretty clearly references memory areas, and that isn't necessarily applicable for the mixed frame mechanism. Think about bytecode interpreters (for example the Baseline compiler for a JS engine such as V8) that reside somewhere on the heap; using Mapping here to indicate that seems incorrect. It'd seem cleaner to add "type" as a field to the "Location" message?

  1. Having a way of de-duplicating identical stack unwinds would be ideal, as @morrisonlevi mentioned. Particularly when doing high-frequency sampling, it is pretty common to have many samples that end up identical (even more so if you'd ignore the leaf frame of course).

👍 Yes. But Gzip seems to do a fairly good job "fixing" this problem as well.

Fair, but this approach honestly gives me the creeps. I'll discuss this in the follow-on post.

I think this proposal implicitly assumes that clients will typically send profiling data every ~60s, but doesn't mention it anywhere. @petethepig maybe this could be added to the OTEP?

We should definitely document that assumption. My reasoning for strongly preferring a "stream many messages continuously" over "accumulate large memory buffers to compress" is partially driven by trying to push work from the production machines on which the profiler runs to the backend, and by avoiding having to keep large buffers of events...

We've had customers that were pretty continuously mapping / unmapping different ELF files, with thousands of mappings in an address space and ~4GB+ of text, and stack traces spanning many files.

That's interesting. I haven't encountered such a situation before. If you think it's a common problem, could you share a bit more about this?

So this is a FaaS vendor that allows their customers to spin up specialized FaaS (essentially tiny ELFs, I presume) on their servers on the Edge. So if I understand correctly, when the FaaS is invoked, they create the mapping, link it to other areas in the address space, run the FaaS, and (I don't know based on what criteria) unmap again. The net result was thousands of mappings, and lots and lots of executable sections.

So this isn't a common setup, but I don't think we should design a protocol that unnecessarily restricts the design space of people writing their code.

In my experience pprof tools only use the mapping for displaying profiling info on an instruction level. But that's an optional feature as it only works if the binary is available. So I think clients can certainly omit Mapping details if they want.

Well, in the current pprof design, you need the mapping information if you want to do any form of ex-post symbolization, too, as the pprof design does not place an executable identifier into the Location message. So this would force on-prod-machine symbolization, which is another anti-pattern we shouldn't force users of the protocol into.

thomasdullien commented 9 months ago

Hey all,

ok, so the following post will be a bit longer, and I hope I don't upset people, but I think I have a number of concerns with the design that we've so far converged onto. The fact that we have converged on something that is sufficiently similar to pprof to debate whether pprof can't do the job reminded me about all the implicit assumptions we've made, and all the things that this will imply for implementations.

So I'll outline my concerns here; forgive me if I am re-hashing things that were discussed previously, and also if this isn't the ideal place to document the concerns.

This will be both lengthy and highly subjective; apologies for that.

Philosophy section

Design philosophy for profiling tools

Profiling tools (if timestamps are included) allow the synthetic generation of fine-grained spans (see @felixge's great description in UC1 here. The upshot of this is that the ability to perform pretty high-frequency sampling on demand has high value, 10000Hz is better than 100Hz in such a scenario. While you won't always need this, you don't want to carelessly limit yourself unnecessarily.

So this means that profiling tools should strive to do their jobs with the minimum number of instructions, data movement, etc. - the more lightweight the data collection is, the more profiling can be afforded given a certain budget, and the increased value does not taper off quickly. This is very different from e.g. metrics; you do not gain much benefit from collecting those at a higher frequency. If you make the collection of a stack trace 3x as expensive, you limit your max frequency to 1/3rd of what it could be, depriving users of value.

Design philosophy for a profiling wire format

In my view, a profiling wire format should be designed as to enable the profiling code on the machines-to-be-profiled to be lightweight in the sense of "working in a slim CPU and memory budget". Whenever sensible, work should be pushed to the backend.

The protocol should also be designed in a way that it can be easily implemented by both userspace runtime-specific profilers, and by eBPF-based kernelspace ones, without forcing one of the two into compromising their advantages.

Concrete section

What's good about using eBPF vs. the old perf way of copying large chunks of the stack?

The big advantage of eBPF is that it allows cheap in-kernel aggregation of measurements (including profiling data). The old perf way of doing stack collections when no frame pointers were present was just dumping a large chunk of stack to disk, and then trying to recover the stack later on (this is a bad idea for obvious reasons).

If your profiling can be done by the OS's timer tick just handing control to some code in kernel space that unwinds the stack in kernel space and aggregates data there without having to round-trip to userspace or copy a lot of data into userspace, you have a pretty lean operation. You walk the stack, you hash the relevant bits, you increment a kernel-level counter or alternatively concatenate a timestamp with the hash into a data structure that userspace can later read (ring buffer, eBPF map, whatnot).

Such a construct is always preferable to having to keep the entire stacktrace around each time it happens, and having to copy it into userspace in it's entirety to be processed there. Let's say we have 40 layers of stack; the difference between copying a 16-byte identifier or 40 stack frames isn't trivial.

Implications of stateless vs. stateful protocol

Early on in the discussion, we proposed a stateful protocol (e.g. a protocol where the individual machine is allowed to send just the ID of a stack trace vs. the entire stack trace, provided it has sent the entire stack trace previously). The reasons for this design in our infrastructure were:

  1. Reduction of network traffic (important particularly when streaming profiling data out of a cloud deployment, but also for high-frequency sampling of Java processes that feature very deep stacks).
  2. Making it "on average unnecessary" for the profiler to copy around a lot of data; if the profiler can determine that it has in recent memory sent the stack trace with this hash, it can immediately stop copying the entire trace around and just handle the ID.

This means that a kernel-based / eBPF-based profiler can simply increment a counter, and be done with it; or send out an identifier and a timestamp. Both are very cheap, but also imply that you don't have to copy lots of data from kernel space to user space (where you also have to manage it, and then potentially gzip it etc.).

If you move to a stateless protocol where you always have to send the entire stack trace, the profiler itself has to move the entire stack trace around, and also pay for the memory of the stack trace until it has been sent.

This is fine if you're sampling at 20Hz - let's say the average trace has 40 frames, and representing a frame takes 32 bytes, a trace is ~1k; that is still 20-40 times more than you'd handle if you were just handling an ID. This will have immediate repercussions for the frequency at which you'll be able to sample.

You will not only need to copy more data around, you'll also need to run some LZ over the data thereafter; you'll pay when you send the data out etc.

Implications of "file format" vs "streaming protocol"

pprof is a file format, in the sense that it is designed for a scenario where a machine collects a bunch of profiling data over a certain timeframe, and then writes the data into a pprof file, which is self-contained enough that it can be processed elsewhere.

File formats assume that the recipient is a disk, something that cannot or will not do any processing, and just stores the data for further processing in the future. Designing a file format means you expect the creator of the file to keep all the data around to write it in a coherent and self-contained manner.

A streaming protocol can assume some amount of processing and memory capability on the receiving side, and can hence move some work from the writer of the data to the recipient of the data. If your recipient is smarter, you can be more forgetful; if your recipient is smart and aggregates data from many writers, each individual writer can be less reliable.

Another aspect is that file formats tend to be not designed for a scenario where a machine tries to continuously send out the profiling events without accumulating many of them in memory.

If we look at the existing OTLP Spec, my cursory reading is that the protobufs do not penalize you much for sending data in a too-fine-grained manner (it's not free, obviously, but it seems to be less problematic than using pprof for a handful of samples).

A spec that is designed for "larger packages of data" forces the profiler to accumulate more events in memory, send them out more rarely, and perform de-duplication by LZ'ing over the data once more. This also has repercessions for the frequency at which you'll be able to sample.

Implications for kernel-based eBPF profilers

The current trajectory of the protocol design is highly disadvantageous to kernel-based eBPF profilers. Instead of being able to accumulate data in the kernel where it is cheap, it forces data to be

  1. Collected in the kernel.
  2. Unnecessarily copied into userspace.
  3. Unnecessarily being de-duplicated by running an LZ compressor (gzip or LZ4 or zstd) over it.

This may not matter if you're already in userspace, already in a separate process, and you're already copying a lot of data back and forth between different address spaces; but if you have successfully avoided wasteful copies like that, it's not great. It forces implementers into objectively worse implementations.

At almost every decision, the protocol opts for "let's push any work that the backend may normally be expected to do into the profiler onto the production machine, because we are fearful of assuming the backend is more sophisticated than a dumb disk drive", leading to more computing cycles being spent on the production machines than necessary.

In some sense, we're optimizing the entire protocol for backend simplicity, and we're doing it by pushing the cost of this onto the future user of the protocol. This feels backward to me, in my opinion, we should optimize for minimum cost for the future user of the protocol (as there will be many more users of profiling than backend implementors).

Summary

I would clearly prefer a protocol design that doesn't force profilers to copy a lot of data around unnecessarily, and that prioritizes "can be implemented efficiently on the machine-to-be-profiled" and "uses cycles sparingly as to maximize the theoretical sampling frequency that can be supported by the protocol".

I think by always optimizing for "dumb backend", we converged on pprof-with-some-modifications, as we were subjecting us to the same design constraint ("recipient is a disk, not a server"). I am convinced it's the wrong design constraint.

felixge commented 9 months ago

Thanks for your writeup @thomasdullien, that was a very interesting read.

One question I had while reading was whether or not it's possible for eBPF profilers to aggregate ~60s worth of profiling data (stack traces and their frames) in kernel memory? If yes, I don't understand why you'd have to copy this data to user space all the time? You could just deduplicate stack traces and frames in kernel and only send the unique data into user space every once in a while (~60s)?

Another thing I'm curious about is your perception on the cost of unwinding ~40 frames relative to copying ~1KiB between user space and kernel (assuming it's needed)? Unless you're unwinding with frame pointers, I'd expect the unwinding to be perhaps up to 90% of the combined costs. I'm asking because while I love the idea of building a very low overhead profiling protocol, I think we also need to keep the bigger overhead picture in mind during these discussions. Related: How much memory do eBPF profilers use for keeping unwinding tables in memory?

Edit: Unrelated to overhead, but also very important: Using a stateful protocol would make it very hard for the collector to support exporting profiling data in other formats which is at odds with the design goals of OpenTelemetry, see below.

CleanShot 2023-09-22 at 13 46 02@2x
tigrannajaryan commented 9 months ago

@thomasdullien you may want to add one more factor to your analysis: the Collector. It is often an intermediary between the producer of the telemetry and the backend.

By design the Collector operates on self-contained messages. If a stateful protocol is used then the Collector receiver has to restore that state and turn the received data into self-contained pdata entries so that internally the processors can operate on them. When sending out in the exporter the Collector has to perform the opposite and encode the self-contained messages into the stateful protocol again to send to the next leg (typically backend, but can be another Collector). This is possible but adds to the processing cost.

We went through this with OTLP and Otel Arrow. OTLP operates with self-contained messages. Otel Arrow was proposed later and uses gRPC streaming and is stateful. I think a similar stateful, streaming protocol is possible to add in the future for profiles.

If you want to go this route I would advise you to make a proposal for a streaming, stateful protocol for profiling data in the form of an OTEP with benchmarks showing what its benefits are and what the penalty (if any) for extra processing in the Collector is.

thomasdullien commented 9 months ago

@thomasdullien you may want to add one more factor to your analysis: the Collector. It is often an intermediary between the producer of the telemetry and the backend.

Good point, I'll read myself into it and then revert :)

thomasdullien commented 8 months ago

Hey all,

a longer discussion between @jhalliday, @brancz, @athre0z, me, and some others had ensued on Slack, and the consensus was that it'd be useful to keep this conversation/discussion archived:

Thomas Dullien: with regards to input data whose conversions into the different formats we're testing: Do we have a standardized data set? I was thinking about generating some profiles for a 15-20 minute period on a few reasonably beefy machines running a representative elasticsearch benchmark; that'd give us data that involves both native code (kernel etc.) and java, and a little bit of python (...) Jonathan Halliday: Why 15-20 minutes? The other OTel signals buffer a lot less that that on the nodes before pushing to the collector, for which message size is capped at 4MB. Likewise why a beefy machine? Is a small VM not more representative of typical cloud workloads? I'm not necessarily against either of those choices, I'm against 'standardizing' on them without rationale. Thomas Dullien: ah, perhaps both in that case? so 15-20 minutes is relevant because it doesn't only matter how much is buffered locally, but also to estimate the total amount of traffic reduction I think and a "beefy" machine is relevant because we'll need to make sure our protocol works for modern machines, too? like, as core counts increase, so does the amount of profiling data emitted from these machines but yes, it'll make sense to look at data both from small and large machines and see how it trends (I recently started profiling on an 80-core altera ARM box, and the data volumes are clearly different there) Jonathan Halliday Right. Whilst there undoubtedly are 80 core workloads in the cloud, the greater likelihood is that hardware being split into e.g. 20x 4core VMs, so you get more, smaller streams into the collector instead. Likewise the total traffic flow over time depends how you buffer at source - you can compress better if you send fewer, larger messages so an hour of data sent as 1 message will use less bandwidth than 60 messages with 1 minute of data. However, when troubleshooting a perf issue in prod I don't have 60 minutes to waste waiting for telemetry data, I need it RIGHT NOW and that's not even taking into account the problem of memory pressure on node, or the collector if all the nodes are sending it huge messages at once. I also wonder how much sampling is from all cores - running JVM workloads we tend to be sampling a thread at random on each occasion, not stopping the world to sample all simultaneously, because that's painful from a latency spike perspective. With that model the problems of splitting 80 core hardware into 20 VMs are actually worse - you're then running 20 samples per interval as each VM samples separately, instead of 1 from the 80 core box, and you can't aggregate them even at the collector, because they are potentially for different workloads and users. So in some cases perhaps the largest monolith is not the biggest optimization issue to watch out for. I'm all in favor of benchmarking a range of scenarios, but I think some of those you're modelling feel closer to edge cases and I'd prefer a protocol optimized for the more dominant mainstream cases, so yes, please try a range of values for relevant parameters and if you happen to base it on data about e.g. the distribution of core counts your customer population actually deploys so we can empirically determine what a typical use case actually is, then so much the better. Thomas Dullien: so the protocol needs to work for people that compute which in my mind includes people like fastly or facebook so the idea that you'll split an 80 core machine into 20 4-core vms isn't a reality, vertical scaling exists and is used for good reasons computing will also further move to more cores, not fewer and if you design a protocol that cannot work properly on a reasonably normal on-prem server, I don't think that's good design work sampling from all cores is the norm for bpf-based profilers, you're not stopping the world to do so, the cores can generate their own timer ticks iirc btw, this brings me to another worry: With the existing 4 megabyte message size limit, and a goal of 1 minute aggregation times, a 64-core or 80-core box will already now require the 1-minute aggregation to be split into smaller pieces. And as you split the aggregations smaller, the current protocol's overhead for re-sending the mappings will go up. as for the issue of buffering on the server: My argument has been the entire time that we should buffer much less, not more, so I think we're in agreement that nobody wants to wait 60 minutes to get their data. anyhow, what I'll do is: I'll wrap my head around the data we have so far in the benchmarks then try to obtain representative data should be able to update everybody on progress on the next meeting perhaps also in the github case (?) Jonathan Halliday: Yes, vertical scaling is a thing and needs to be supported, but not necessarily optimized for. Here in CNCF land it's not the dominant thing. Cloud native workloads tend towards horizontally scaled design. Hardware is adding more cores, the workloads on it ironically using fewer per deployment as fragmentation into microservices gains more mindshare. I recall spending a lot of time making JavaEE app servers scale, especially with GC on large heaps, whereas now the effort is around making Quarkus fit in ever smaller VMs... I'm inclined to agree with you on less buffering. Targeting something around 1-2minutes feels about right, though less may be better. I'll also note that part of the existing proposal includes allowing for the original data (e.g. JFR file) to be included in the message, so the actual space available for the other fields may be substantially less than 4MB. Realistically I think the practical answer is probably sampling less, not raising message size. There is a limit to how much utility you get from the additional samples and at some point it's outweighed by the data handling costs. I'm just not yet clear on where that point is. If enough of our users start abandoning JFR for eBPF because they want all the threads instead of a random one, I guess I'll have my answer. Thomas Dullien: so I think I disagree. K8s clusters tend toward bigger machines, and microservices do not imply VMs; VMs are disproportionately heavily used in the Java space because the JVM and K8s/docker played poorly together in the past. So for anyone that will do whole-system profiling, supporting large machines is a must. The JFR/eBPF juxtaposition is not really applicable. JFR is a JVM-world construct, without a spec outside of a java-centric reference implementation; I am not aware of anyone outside of the Java ecosystem really using it (? -- I may be wrong). People that work on Rust, Go, NodeJS, C/C++ etc. really don't. The precise number of samples you need can be calculated pretty easily; there's good math for how many samples you need to estimate a flamechart with acceptable accuracy. Reducing the number of samples then increases the latency that users will have to tolerate until their profiling results are useful. In general, the utility curve for samples is kinda U-shaped: You want enough samples to generate flamecharts in reasonable time for your services, then there is a long trough where additional samples don't provide much value, and then utility goes up again as your sampling rate is so high that it can be used to perform latency analysis (this requires fine-grained timestamps on samples though). All this said: Perhaps it'd have been a good idea to -- before embarking on protocol design -- flesh out the user personas we think the protocol needs to support :slightly_smiling_face:. My view is clearly informed by the infrastructures I have seen/worked with:People running reasonable-sized K8s clusters, possibly even large K8s clusters, with heterogeneous workloads, many different runtimes, many different services, and the desire to not have to do per-service instrumentation. Jonathan Halliday: Hmm. Very interesting and useful discussion. I'm not sure I agree with you on the cluster sizes, but perhaps we're just working from different data. Hardware used to build them is certainly trending larger as chip design is, but the isolation unit, which is the scope for profiling, is trending smaller. Doesn't really matter if it's VMs in the vmware/KVM sense, containers, or some lightweight isolation (firecracker, gvisor etc), or even language runtimes like the JVM, there tends to be a several of them on the physical node and profiling of each is independent because the workloads they are running are. Our user's Java workloads are increasingly K8s (actually OpenShift) and for new build apps more than a handful of cores per isolation unit is unusual. There are legacy workloads shifted to the cloud with a 'just wrap it in a VM' approach and those can be bigger, but the dominant logical deployment unit size is not big as it was a few years ago, even as hardware has grown in the meanwhile. Whilst JFR is very much a Java thing, the design pattern of picking a thread at random to observe is decidedly not. Until eBPF came along it was pretty common for sampling profilers, because when all the kernel gives you is a SIGPROF sized hammer, then everything must be a randomly chosen nail. The optimal number of samples you need is indeed rooted in math, but also somewhat influenced by practical economics. Plus what you're getting is not quite as straightforward to determine in all use cases, because for some cases you can e.g. stitch together samples from different instances of the same app. If you deploy three replicas, do you want to sample from all of them at the same time? With sufficient fidelity that you can reason about individual instance performance, or only just enough that you can work from the aggregate? From one randomly chosen one in some form of rotation? There is a side discussion to be had on dynamic configuration and the ability to push profiler config changes to node that probably need to wait until OpAMP matures, but stepping up the sampling on a suspect node is definitely an interesting direction to reduce costs. Users wanting to avoid having to do per-service instrumentation is missing the point. They do want per-service data and extracting it cleanly from whole-system samples is a pain (and in multi-tenant environments often completely inappropriate). Users will already be deploying the OTel agent on a per-service basis for other telemetry signal types. That's the logical deployment model for profiling too. There is no additional problem to using it that way, beyond what they have already accepted when adopting it for other signal types, whereas switching to a whole-system model or mix and matching it with the per-service model is a pain. The aggregation point is at the collector, not the originating physical hardware, because you want to aggregate things that are logically related, not that just happen to be living on the same silicon. Beau Belgrave: This is indeed an interesting discussion, maybe I missed it, but are we really only interested in continuous profiling and aggregations? We (and I assume most folks in production systems) have periods of time where we want to take a single profile and mark it as "bad" or "anomalous", and due to the size of the machines running, it is not cost efficient to always be collecting this data. I'd like to ensure something like the above is possible with the protocol, and I agree that 4MB will be very limiting for these scenarios, especially on large machines that are doing system wide profiling (That's what we do for every machine). Joel Höner Assuming that a “service” would typically correspond to a container, the per-service deployment model that you (Jonathan) are proposing there is quite fundamentally incompatible with any profiler built around eBPF. Any kernel instrumentation with eBPF is always system-wide (per VM) and needs root permissions on the host system. Even if that wasn’t the case: we consider the user experience where you just deploy the profiling agent across your fleet and then get profiling for your entire workload one of the major upsides why you would want to do continuous profiling in general: it gets rid of the classic “oh, surely this service is the one that needs optimization” fallacy and gives you an unbiased overview of what is burning cycles without requiring making any prior assumptions. (edited) Thomas Dullien: Agreed on this being an interesting and useful discussion :slightly_smiling_face: (I sometimes think we should perhaps have these discussions in a Github case, because Slack will not be discoverable, and when people try to understand the why of the spec, it'll be useful for them to have the context). I think the claim that "users want per-service data" is not as clear-cut. Companies like Spotify run regular efficiency weeks where they seek out the heavy hitters in their infrastructure and then optimize, and if you look at the things GWP did at Google (and Facebooks fleet-wide profiler does): There's a tipping point where your heaviest libraries will be heavier than your heaviest application. And unless you can do system-wide profiling and have both the ability to slice per service and to aggregate in total, you won't see your heaviest libraries. The biggest impacts at Google were in optimizing common libraries. It is also exceedingly common to find "unknown unknowns" wasting your CPU time. Without joking, we have a customer that runs an enormous cloud footprint and intends to haggle with the cloud provider about clawing back money for CPU time spent by the cloud providers in-VM agents. And to some extent this loops back to what I said earlier about "perhaps we should've defined our user personas". We didn't, and that leads to us having extremely divergent viewpoints of what the protocol needs to provide... Fleet-wide whole-system profiling is pretty clearly something that is here to stay (and will become more important now that BOLT is upstream in LLVM etc.). Parca, Pyroscope, and us are all seeing a good bit of adoption, and if you factor in GWP and Facebook, continuous whole-system profiling is pretty important. So designing / standardizing a protocol should not prescribe a particular implementation; my worry is that we're standardizing on something because it happens to be convenient to implement for some existing tools while locking in certain design decisions for the profiler that simply don't serve a large fraction of the profiling use-case. Extracting per-service profiles from whole-system samples has also not been a pain for us (a sample is a sample) ... (edited) Austin Parker: this is a fascinating discussion that I would strongly suggest get recorded in a GitHub issue for posterity :) Frederic Branczyk: I guess unsurprising I mostly agree with Thomas since Parca is also mostly a whole system profiler. A few more datapoints, when I was running openshift telemetry (all openshift clusters send a decent amount of data about clusters and cluster health back), there were a lot of very large node clusters so the point of small nodes being the primary way is simply factually false. I’m not allowed to share number but large nodes are very common. Second, extracting per service samples from whole system profiling is not a protocol concern, the storage/query engine should handle this. Inserting service semantics is adding storage/query engine semantics into the protocol. (edited) Thomas Dullien: At least Google-internally, even when I was leaving in 2018 the trend was toward big nodes on which to run Borg (read: K8s, even if it's not quite the same) workloads. The reason why people that run DCs choose big nodes is per-core power efficiency; the container orchestrators then shuffle workloads around between machines. The trend toward large nodes accelerated noticeably between 2011 (when I joined Goog) and 2018 when I left. From what I know of other companies that have a section of their infrastructure on-prem, the setup and trends seem similar (FB, Crowdstrike, Walmart, Spotify etc.) Jonathan Halliday: Bigger nodes yes, but smaller pods. To fit with the OTel architecture, the instrumentation is running as part of the guest workload inside the pod, not as part of the platform. The profiler is deployed via the workload language sdk, not with the kubelet. The profiler will never see the entire physical hardware. At most it sees a VM/container (almost) the same size as the hardware, in the case that a single pod occupies the whole node. More usually there will be N heterogeneous profilers, one per pod, perhaps for different languages and never, ever seeing one-another's data. It's unacceptable to break the isolation even in single tenant clusters, as the other guests may be for different apps with e.g. regulatory or security requirements. Profiles are always per-pod. If they are whole system, then the profiler must be intimately connected with the hypervisor/container so it can label, at a per-thread granularity, which pods the capture relates to, so that the backend can break them out in a way that makes sense for users who want to (or should only be permitted to) inspect their own workloads. That's a very different engineering and maintenance challenge to bundling the profiler with the OTel language SDK. Frederic Branczyk: I don’t understand why that has to influence the wire protocol though, for what it’s worth, the last part exactly what we do in Parca Agent, we attach kubernetes metadata and labels to profiling data, it’s not easy but certainly possible and actively being done. We have highly regulated customers doing multi tenancy with labeling and/or routing of data on top of this. On an entirely separate note, and maybe this is just an otel thing that we have to specify a proto, but it appears that most of the otel project is gravitating towards defining things in apache arrow, but the current proposal is in proto. Why not start out with the thing that appears to be the future of otel? Thomas Dullien: Is there an official statement that "all of otel is always app-specific"? Because it's news to me, and then the question is why us, or parca, or pyroscope, or anyone that supports perf record is even participating? and fwiw, yes, we label on a per-pod granularity what pod the traces belong to (...) btw, jon, could you explain a bit more how whole-system profiling breaks isolation? Because it's not like any service sees any data for any other service? (perhaps I am confused by stuff like https://github.com/open-telemetry/opentelemetry-ebpf) which clearly gathers whole-system data anyhow, stepping back for the moment: Is the argument "the protocol doesn't need to support 64 cores"? :slightly_smiling_face: Jonathan Halliday: Profiling is per isolation unit (in k8s terms, per-pod) because how and what to observe is a per-deployment choice and independent of what platform that deployment is to. Observation is not something the platform provides for things deployed to it. It provides isolation (as a core feature, not a bug) and that's it. I should be able to move my deployment from bare metal to VM to public cloud and back to private cloud, all without ever breaking my obs instrumentation or having to integrate it with the platform. The platform, particularly on public cloud, should not be observing my workload, in much the same way there should not be a camera in my hotel room unless I put it there myself. Sure it's possible to make other technical architectural (or societal/cultural) choices, but those are the ones current cloud design has come up with and changing them is beyond the scope of our effort here. In time perhaps the expectation will shift to having built-in profiling be an important feature of the platform, but that's not where we are today, or even where we're trying to get to. What we have currently is a user expectation that they deploy the OTel agent with their app and it gives them observability data, regardless of where they deploy that bundle to. A profiling standard developed under OTel, which in turn is under the CNCF, should fit the established architectural model of those organizations. That is also why fairly early on we ruled out the model where profiling data is held at point of origin until needed, and possibly even then queried in-situ, rather than being pushed eagerly to a backend. It's a perfectly viable architecture with some attractive qualities, particularly in bandwidth saving, but it's alien to the cloud model of 'things are stateless by default' and the OTel model of 'data is periodically shunted from the agent to the collector'. Making a pod stateful just for observability data is not worth pain of the mismatch it creates. Making the collector stateful just for profiling wire protocol is likewise not worth the pain. Breaking isolation, or platform independence of deployments, is not worth the pain. If we want to ever ship something, trying not to swim against the tide may make life easier. Thomas Dullien: FWIW, I think there's a lot of "citation required" in what you are writing. I'll respond in more depth after lunch :slightly_smiling_face:. Jonathan Halliday: yeah, I should prefix every other sentence of it with 'It is my opinion that....', but I'd need a keyboard macro to automate it or I'd get very tired of typing :slightly_smiling_face: Some of it is also deliberately provoking debate, because we could use a bit more of that, and likewise provoking others to prove me wrong with hard data, because that's even better. If it's demonstrated empirically that e.g. a stateful protocol would save ${x}% bandwidth, I'm still not necessarily going to agree it's worth the pain, but I'm going to be very happy indeed to know what the value of x actually is. Thomas Dullien: Awesome, thanks for that clarification. I don't think that the case that "profiling is per pod" is clear-cut. I can deploy system-wide profilers like ours into my K8s cluster without problem, and I can have these K8s clusters live on bare metal, a VM, public cloud, private cloud etc. -- and the benefit is that the person running the cluster gets full observability for all workloads. There is zero breaking of obs instrumentation or special integration with the platformm and no breaking of inter-app isolation. In general, there are three "user personas" for profiling data: (1) SREs for troubleshooting (2) App developers to optimize their own app (3) platform/performance teams that manage overall efficiency of the fleet. For both persona (1) and persona (3), fleet-wide system-level profiling is more valuable than per-app profiling, and these are important user personas. Google has done fleet-wide profiling since 2011, Facebook a little bit later, commercial products have seen adoption since 2021, and I am not sure how you intend to provide any data about what the expectation of the wider computing community with regards to profiling is. There are strong reasons for fleet-wide system-wide profiling as soon as your footprint exceeds a certain size, and large benefits both on the economonical and ecological level. The established architectural model of CNCF involves a service mesh; if Cilium or any eBPF-based service mesh fits the CNCF model, so does a cluster-wide profiler. When we discuss stateful vs. stateless, we really come up with a clear definition of what we even mean with that. APM traces aren't stateless in OTel if we define stateless as "the data doesn't need to be reassembled from multiple sources to be useful". Perhaps better terminology of what we're discussing is "send full inline stack traces vs. send fingerprints of stack traces"; this more accurately reflects what we're discussing. I think in the broader sense, I'd like to decouple the question of "full inline stack traces vs. fingerprints" from the broader discussion of faults in the current protocol spec. When I pointed out that the current spec won't be able to support 1-minute aggregation of data on a beefy host, and that the current protocol spec will add much more overhead once you have to fragment messages more, I didn't necessarily mean this as a strong argument for fingerprints, but rather a "oh wow, this protocol needs more eyes, more effort, better-documented and wider benchmarks, otherwise we run the risk of standardizing a dog" (or perhaps not a dog, but a protocol with pitfalls and bad assumptions/corner cases that could've been avoided). (edited) Joel Höner: It’s also perhaps notable at this point that profiling traces for native applications will always inherently be stateful in the sense that you’ll typically not be able to assign any meaningful symbol information until they are ingested and processed in the observability system. The alternative requires shipping all executables with debug information, which is very inefficient in terms of disk and bandwidth use. Thomas Dullien: I think we really need to define what we even mean with "stateful" :slightly_smiling_face: because we're using that word, and I am not sure whether we all agree on it's meaning. Jonathan Halliday: Stateful to me means that to do processing on the data in the same way it does for other signals, the collector need to hold (in-memory, since it doesn't necessarily have disk) more than one message worth of data. Or to put it another way, it can't necessarily handle one message without knowledge of prior messages or reference to external data. With that comes a larger footprint and headaches with crash/restart and load balancing. None of which the collector currently worries about and all of which add undesirable engineering effort or deployment cost. The bit we haven't really settled is: what processing should the collector be expected to be able to do? If it's pure pass-through, it doesn't care if the stream tunneling through it is stateful or stateless, it just passes it along. If it's adding metadata, likewise it probably doesn't need too care much. If it's trying to do really strict security filtering, it cases a lot. Personally I think the sweet spot is 'can annotate or filter based on OTel metadata such as KeyValues, per sample', which is why I'm in favor of an API that exposes and handles the Sample as an annotated entity (which is not necessarily the same as the way the proto encodes it) but doesn't expose the trace data in that sample. As such, I don't much care if the trace in that sample is symbolized or otherwise resolved by some previously sent hash lookup table or not, nor is that encoding exposed to the collector as pdata, so it can't tell and won't care if extra information was required to resolve it. In keeping with my previous 'don't deviate from the existing architectural choices' position, it's worth noting this ignores my own advice. For other OTel signal types, the collector has near 100% visibility into the data, at least to whatever extent e.g. string pattern matching is sufficient. This gives it a lot of expressive power to manipulate the data. I'm willing to take the deviation on the basis that keeping strictly in line with the existing model would require a lot more work and potentially use a lot more resources at runtime, with no apparent use case beyond 'because that's just how it's done'. Even then, as noted, without symbolization you're limited to matching on some things that aren't really meaningfully matchable, much like trying to filter compressed strings without decompressing them. Where this is going to break is that if you have a backend that is non-stateful, but the data flowing into the collector is stateful, then the collector needs to buffer and transform the stream before forwarding it. Likewise in the other direction, but I anticipate the dominant case is using the OTel protocol between originating node and collector, then a foreign export protocol to a backend that doesn't speak native OTel yet. For that you'll need the collector to host an exporter that can act as a pseudo-backend to reassemble the stream, which brings its internal pipeline back to being stateful overall even if its transformation stages aren't. At least that means the problem is limited to just the users wanting to do the transform, not all users of the collector.

This is the current state of the discussion. The really important thing I agree on is: We need to really define what computation the collector needs to perform on the data as it comes in. I also think we should replace the word "stateful" in our discussion with "self-contained messages", because it is more precise.

As next steps, for the meeting on Thursday, I will provide some data on the implications of both the gRPC message size limit and the desire to be self-contained, because I see a lot of problems arising from precisely this interaction.

thomasdullien commented 8 months ago

Joint work with @christos68k and @athre0z. Hey all,

as promised, a bit more data/revisiting of the discussion around self-contained messages vs. non-self-contained messages.

The goal of this post is to document the exact trade-offs involved in a self-contained vs. a non-self-contained protocol design.

Bottom line up front (BLUF) - details below

Before we get into the meat of things, I noticed a small design issue with the current proposal:

How to represent two versions of the same Java (or other HLL code) in the current proposal?

The current proposal does not have an obvious way of distinguishing between two versions of the same HLL code. For native code, the mappings contain a build ID, and according to our discussions, this build ID should be a hash of some parts of the .text section for executables, so this will distinguish different versions of the same native code. For HLL code, there is no .text section in memory, and AFAICT the only way to distinguish two versions of the same Java method in memory is by hashing the bytecode (which is what we do). Reflecting the hash of the bytecode in the pprof-style proto design isn't trivial though -- one could use the Mapping fields, but this would mean that you end up creating a separate Mapping per Java method that is hit, and Mappings have to be re-sent on each collection interval. We should discuss how to address this design issue, and where to best add extra information that allows distinguishing two different versions of the same HLL code.

The test machine and workload

I am worried about us designing a protocol that can't deal with a real-world server running a whole-system profiler, so for this test-setup, we are running tests on an Altera ARM server with 80 cores. This generates a fair bit of data, but there are 224 core machines, so this isn't even near the top-of-the-line, and we can expect a top-end server to generate 2.5x+ more data. Workload-wise, we tested running an ElasticSearch benchmark (esrally with http_logs) locally, and also a run of some multicore workloads from the Renaissance JVM benchmark. We focused on the JVM because in our experience JVM workloads generate more profiling data than most other workloads -- the stacks are deeper than in many other languages, the function names, file names, and classnames are longer, and all of the work of processing this data has to happen on the production machine.

What we did for testing

We did three things:

  1. Converting our existing profiling data into the pprof format (work in progress). Once this is done, we will also be able to break out more statistics about which fields in the protobuffer contribute how much, which will help us hopefully identify areas for improvement. I hope to have this working by the meeting tonight, if not, tomorrow.

  2. Hacking a self-contained version of our existing protocol, comparing traffic both compressed and uncompressed (thank you Christos, this has been invaluable). The protocol spec is described here: https://docs.google.com/document/d/1oh5UmV5ElQXIKG6TX0jvvkM--tRUoWrtZ7LXlJUNF1s/edit

  3. Estimating the additional gains we'd obtain for a "non-self-contained protocol which separates out the leaf frame".

Exceeding the default gRPC message size limit (possibly by 32x)

It's almost certain that the current design will exceed the gRPC message size limit. When considering uncompressed self-contained messages, our preliminary testing showed ~3 megs of uncompressed data for a 5-second interval on our 80-core machine running at approximately 60% load. This implies that a 1-minute interval may reach 60 megs, and if you factor in load and higher core counts, 128 megs may definitely be within reach. Caveat: This is our internal self-contained-protocol, not the current proposal. I am still in the process of converting our own profiling data into the proposed format to get a better estimate for the average size of a stack trace for our workload in the proposed format, and expect to have that later today or tomorrow, but before even adding the traces, the size-per-trace is ~100 bytes, so there's about 100 bytes of overhead per trace without even the trace data.

Network traffic, compressed and uncompressed, empirical results

In our testing, the difference in network traffic after compression is a factor of approximately 2.3-2.4x -- for 100 megs of profiling traffic in the non-self-contained protocol, the self-contained protocol generates 230-240 megs after compression does it's magic.

How much magic is the compressor doing here?

time="2023-11-02T15:08:08.546674827+01:00" level=warning msg="NON-SELF-CONTAINED  OUT: 32837007  270815603 8.25"
time="2023-11-02T15:08:08.546716588+01:00" level=warning msg="SELF-CONTAINED OUT: 77053361 1983014544 25.74"

The compressor does a 25x reduction here, we're performing compression on 1.9 gigs of data to end up with 77 megs of traffic. FWIW, this will also be reflected on the backend: You'll need to process vastly more data; the ratio on both the backend and the individual hosts is about 7.3x.

Further reductions from separating out leaf frames

The above non-self-contained design hashes the entire stack trace. Most of the variation of the stack traces is in the leaf frame, so the above can be further boosted by separating the leaf frame out of the trace, and then hashing "everything but the leaf frame". This reduces the number of traces to be sent by a factor of ~0.67 in this workload. This means that for ~67 megs of profiling traffic, the self-contained protocol will send about 230-240 megs, meaning the likely network traffic overhead is about 3.5x post compression.

The ratio of uncompressed data to process grows to about 10x (it's 7.3x before, divided by the 0.68, yielding about 10.7x).

Summary of the data:

Next step

thomasdullien commented 8 months ago

Ok, I think I converted data from our example workload to pprof. I took a few shortcuts (e.g. skipped inlined frames, which accounts for about 10% of all frames), and I end up with an average cost-per-trace/cost-per-sample of 180-200 bytes.

When sampling at 20Hz per core, we see somewhere between a 1:2 to a 1:4 reduction in traces vs. events, e.g. for 100 sampling events, you get between 25 and 50 different stack traces (often differing only near the leaves).

Note: This data does not attach any labels, metadata, etc. to any samples so it may be underestimating the real size by a bit. I'd wager that 200 bytes will be near the norm for Java workloads, with things like heterogeneity of the workload and some parts of the code adding variance. I suspect it'll be rare to end up with more than 400 bytes per trace, unless someone begins to attach stuff like transaction IDs to samples.

This means our message size will be approximately as follows:

Cores Max samples per Minute Max Msg Size (200 bytes p.s.) Max Msg Size (400 bytes p.s.) Discounted by 0.6 (load) Discounted by 0.5 (dedup)
64 76800 14.65 mbytes 29.3 mbytes 17.6 mbytes 8.7 mbytes
80 96000 18.31 mbytes 36.62 mbytes 21.97 mbytes 10.99 mbytes
128 153600 29.3 mbytes 58.6 mbytes 35.16 mbytes 17.57 mbytes
256 307200 58.6 mbytes 117.18 mbytes 70.31 mbytes 35.15 mbytes

This means - as discussed in the meeting - it will be possible (though very unlikely) for a single message to reach 100+ megs on a large machine; it will be likely that we exceed 8-10 megabytes routinely on reasonably common machine configurations.

I think a really crucial thing to understand as next step is: What is the memory overhead from converting a gRPC proto into an in-memory representation?

My big worry is that big messages of this type can easily lead to a lot of memory pressure for collectors, and my expectation of a good protocol design is that a single collector can easily service hundreds of machines-to-be-profiled (the protocol shown by Christos above easily dealt with hundreds to thousand+ nodes and tens of thousands of cores on IIRC 2-3 collector processes -- I'd have to dig up the details though).

tigrannajaryan commented 7 months ago
  • Network traffic overhead for self-contained messages is estimated to be about 3.5x, after compression (and neglecting any cost of compression and decompression).
  • The uncompressed data size is about 7.3x-11x, which means each host on which we are profiling needs to compress 7.3x-11x more data (and the backend needs to do the decompression, too).

@thomasdullien these numbers show the performance difference between stateless and stateful formats described in this doc, right?

christos68k commented 7 months ago
  • Network traffic overhead for self-contained messages is estimated to be about 3.5x, after compression (and neglecting any cost of compression and decompression).
  • The uncompressed data size is about 7.3x-11x, which means each host on which we are profiling needs to compress 7.3x-11x more data (and the backend needs to do the decompression, too).

@thomasdullien these numbers show the performance difference between stateless and stateful formats described in this doc, right?

Correct (we did some small changes to the benchmarked protocols, namely removing some fields we no longer use such as SourceIDs, I updated the document to reflect these changes).

felixge commented 7 months ago

@thomasdullien thanks for the analysis in your two comments above. I spend some time reviewing this today, and here are my summarized conclusions (I hope to discuss this in the SIG meeting starting now).

To be clear: I would take a 3.5x increase in efficiency on the wire very seriously when it comes to the stateful vs stateless discussions. So I wouldn't mind being convinced of it :).

thomasdullien commented 7 months ago

The data used for the 2nd comment (the pprof stuff) are 1-minute intervals of profiling data in JSON format that I converted to the pprof format by hacking around in the existing benchmarking code. I have attached an example JSON file here (about 400k zstd-compressed for 1 minute of data). I have 1 hour worth of these files for the testing.

I'll have to dig to find the go code I hacked to convert it, give me a day or two - it's been a few weeks.

elasticsearch-server-profiles-minute-1.json.zst.gz .

felixge commented 7 months ago

Thanks @thomasdullien, this is great. About the code for the pprof stuff: Don't worry too much. You clarified in the SIG meeting that the pprof experiment was mostly for discussing the gRPC message size issue, rather than comparing stateful against stateless in general, and I'm fully convinced of that argument – we need to investigate this problem further.

What I'm really looking forward to is seeing the results of terminating the stateful protocol every 60s (flushing the client caches) and comparing the data size this produces against the normal operations of the stateful protocol.