open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.52k stars 1.48k forks source link

What is an OpenTelemetry Collector, what is a distribution? #8555

Open jpkrohling opened 1 year ago

jpkrohling commented 1 year ago

We had a discussion recently around what is an OpenTelemetry Collector and what is a distribution of the Collector. I would like to gather your opinions.

@dyladan proposed that only what the SIG Collector produces can be called an "OpenTelemetry Collector" and that a distribution has to fulfill the following requirements:

I tend to agree with him, but I'm eager to hear your opinions. The GC might have the right to make the final decision if we can't get an agreement, but I think we can indeed reach a consensus, at least between the GC and the Collector maintainers (core and contrib).


Update - 2024-07-17: based on the state of the discussion so far, here are the issues we identified:

dashpole commented 1 year ago

Here was my take from 2020: https://docs.google.com/document/d/1jHOYTRRI91UdyMEfqV7WNPEAxSQKP13b_jPcQX4oe9I/edit?usp=sharing

TL;DR

Other projects (prometheus, kubernetes) have successfully created conformance programs by testing conformant behavior, rather than requiring the use of certain code packages. An example of "conformant behavior" could be:

The easiest way to construct a "conformant" collector distribution would be to simply use collector libraries, or the collector builder, but it wouldn't necessarily require it.

djaglowski commented 1 year ago

I like the idea of defining conformance to a standard but it's unclear to me what we are suggesting will be the effect of being conformant. In other words, let's say we define what it means to be an "OpenTelemetry Collector", and someone has a product which meets all the requirements. Isn't it still a trademark issue for them to say that their product is an OpenTelemetry Collector?

IANAL but as I understand it, The Linux Foundation has a trademark on the term OpenTelemetry and their trademark guidelines define how the trademark may and may not be used.

e.g. It would be a trademark violation for a company to name their product "Company OpenTelemetry Collector" because the trademark may not be used in a product name. However, it is ok to use the phrase "Company Distribution for OpenTelemetry Collector" because it is a reference to the trademark and does not imply that the trademark is part of the product name.

I don't mean to nitpick but I can't figure out how one would communicate the fact that they officially have an OpenTelemetry Collector without violating the trademark guidelines.

atoulme commented 1 year ago

What does this clarification do and how does it help the project? I am unclear on why this is coming up, is this impacting the OpenTelemetry project's ability to graduate within the CNCF?

bryan-aguilar commented 1 year ago

I like the idea of defining conformance to a standard but it's unclear to me what we are suggesting will be the effect of being conformant.

I think in this case the effect would be that you cannot call yourself a "Collector distribution" without passing X,Y,Z conformance tests.

I think the trademark issue is separate though and has already been enforced in the past.

includes only plugins/components which are compatible with the collector framework. they don't need to be in the otel repos, but you should be able to point the upstream collector builder at them

I'm not sure I fully understand this one. Would this by proxy mean that "a collector distribution" must be built, or be able to be built, with OCB? I think this may be too limiting. Consider this scenario. Contributor X build a new Collector component type. It is ideal for their specific use case, and they don't plan on contributing upstream but they build it on top of the collector framework. OCB does not recognize this component type and thus fails to build it. Would this not qualify as a distribution?

codeboten commented 1 year ago

Just linking this other issue here that suggests a distribution should be added to the spec: https://github.com/open-telemetry/opentelemetry-specification/issues/2873

As the issue points out, distribution is already in the official documentation: https://opentelemetry.io/docs/concepts/distributions/

codeboten commented 1 year ago

Note the doc linked above also includes a link to the definition of the collector today: https://opentelemetry.io/docs/concepts/components/#collector

The OpenTelemetry Collector is a vendor-agnostic proxy that can receive, process, and export telemetry data. It supports receiving telemetry data in multiple formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools) and sending data to one or more backends. It also supports processing and filtering telemetry data before it gets exported.

austinlparker commented 1 year ago

I guess my question would be if Collector SIG disagrees with the definition of distribution that's currently on the website.

bryan-aguilar commented 1 year ago

One thing that came up today during discussions today at the Operator Sig and also separately in discussions with @Aneurysm9 is command support.

Should collector distributions be required to support both the Collector validate and components command? Do we need to ensure that any future commands are able to be supported by distributions that do not use OCB?

cc: @jaronoff97

jaronoff97 commented 1 year ago

My expectation as someone building features on top of the collector is that any collector distribution uses the collector builder or at least can be marshalled in to a struct that matches the collector go framework. Being able to adhere to that would ensure that how we design Kubernetes features will always work for any distribution.

trask commented 5 months ago

What does this clarification do and how does it help the project?

I think this is a great question to help anchor this discussion.

Here's one scenario that comes to mind.

Consider if (hypothetically) Google offers an OpenTelemetry Collector Distro for GCP that has lots of great 1st party GCP support.

But their distro doesn't include (hypothetically) the Honeycomb Marker Exporter, because they don't want to be on the hook for supporting that exporter.

This situation seems somewhat unavoidable, as I'm not sure we want to force all distros to include all components, both for size and support reasons.

If the OpenTelemetry Collector could support dynamic linking, then users could just drop the Honeycomb Marker Exporter into their GCP distro, and the problem is solved, but it sounds like dynamic linking is a no go because of Go.

So we would need another way to ensure that OpenTelemetry Collector distros can be extended and don't lock users into the distro's ecosystem.

[just for one example, potentially we could say that anything called an OpenTelemetry Collector distro must be built using the OpenTelemetry Collector Builder and that all the distro components must be publicly available so that users can extend the distro themselves]

yurishkuro commented 5 months ago

@trask I don't think your example answers the question, at least for me. And we had an hour-long discussion on the call where we still didn't explicitly enumerate what problems we're trying to address by the discussion. I heard at least two problems, one on the call, another in your answer:

  1. OTEL Collector maintainers are concerned with getting a lot of user questions in the official Slack related to 3rd-party collector distros, all because they are calling themselves "OTEL collector ..."
  2. (from your comment) A user who's running a 3rd party distro needs to add another component the collector, what do they do

Some thoughts on (2):

trask commented 5 months ago

if the 3rd party distro is fully open source, then user can just build their own flavor that includes additional components

it's not very user friendly and about 100x more painful than the plugin-based ecosystems I've worked with before where I can just upload a pre-built component into my existing system. I guess I was hoping we could get as close to the convenience that other plugin-based ecosystems offer, within the constraints of Golang.

whatever the solution, the discussion of "what is collector" seems quite tangential to the problem

I think the connection is that we have an opportunity to make requirements on something that wants to call itself an OpenTelemetry Collector distro, and so it's our chance to enforce something like this (if we want)

fwiw, the example I gave

[just for one example, potentially we could say that anything called an OpenTelemetry Collector distro must be built using the OpenTelemetry Collector Builder and that all the distro components must be publicly available so that users can extend the distro themselves]

aligns with the definition proposed by @dyladan and @jpkrohling above:

that a distribution has to fulfill the following requirements:

  • uses the collector framework (upstream not a fork)
  • includes only plugins/components which are compatible with the collector framework. they don't need to be in the otel repos, but you should be able to point the upstream collector builder at them
yurishkuro commented 5 months ago

includes only plugins/components which are compatible with the collector framework.

This ^ already excludes existing distros that use proprietary code. More importantly, it doesn't answer the question which problem a definition like this solves. I see no reason to debate the criteria without deciding why we're doing it. To quote a good book:

  • “Would you tell me, please, which way I ought to go from here?”
  • “That depends a good deal on where you want to get to.”
  • “I don't much care where.”
  • “Then it doesn't much matter which way you go.”
trask commented 5 months ago

I see no reason to debate the criteria without deciding why we're doing it.

I totally agree which is why I tried to provide one possible "why" above. I'm looking forward to seeing what other "whys" people have in mind.

djaglowski commented 5 months ago

The primary reason I care about a definition here is that users are advised to limit the collector to contain only the components necessary for an environment. In the absence of a dynamic plugin model (which to my knowledge no collector maintainer believes is feasible), we are recommending that users deploy a "collector" that we have not built ourselves. Since we are not recommending a concrete binary, I believe we need to define precisely what we are recommending. Additionally, we expect that as a user's needs evolve they will migrate to another "collector" that contains a different set of components. Therefore, a definition would serve to establish expectations for what stays the same between "collectors" vs what may be different.

I would like to highlight that the issue asks for two definitions, but there appear to be at least three categories of collectors which have been discussed. Very roughly:

  1. "Official" collectors - those produced by the Collector SIG
  2. "Custom" collectors - those produced by users following our recommendation to limit components for their environment
  3. "Distributions" - those published by vendors or organizations

The conversation so far seems to have blurred (2) and (3), and we might explicitly conclude that this is not an important distinction. However, for now, I'm drawing this distinction because the "whys" I've described above specifically apply to (2).

tedsuo commented 5 months ago

I have two problems that I would like to see resolved.

Problem one: remove confusion about what a Collector is

The first problem is basic confusion about "what a Collector is." Not a Collector distro, but the term Collector itself.

If someone points to a binary and calls it a Collector, just about everyone in the community would assume that the binary is a build of the collector codebase plus some plugins. Even if a binary was described as some kind of "Vendor Specific Collector Distribution," that core assumption would still be there.

That seems a bit obvious, but we're now starting to see projects pop up which don't match this definition. One example is Grafana Alloy. My understanding is that Alloy is basically the pre-existing Grafana agent, plus some additional components that it shares with the Collector codebase. Which is a totally fine thing to be! But when I first came across it, it was described as a "vendor neutral OpenTelemetry Collector distribution." Like everyone else in the community, that description made me think it was something completely different – that it was the Collector codebase plus some Grafana-specific plugins. I was super confused when I discovered that wasn't the case!

Again, no disrespect to Grafana or the Alloy project; it seems like a totally fine project to me. But the naming threw me for a loop. Imagine if CouchDB started calling itself Redis because it shared some Redis code in order to add a feature. That would be really confusing!

I'm sure the Grafana folks are reasonable, and we can just talk to them about it. But I imagine that there may be more instances of this in the future, so it seems prudent that we provide some kind of official definition of a Collector that roughly matches community expectations, in order to avoid confusion. Namely, that a Collector is a build of the collector framework plus some plugins.

Problem two: who do I talk to for technical support?

At the heart of all the various collector distro discussions is the question "who is responsible for helping me with this thing?"

We have users who come into our slack channels asking for technical support. What technical support do we want to give? Who do we point them to if we don't want to give them support? Do we just support the core and contrib builds of the Collector? What if a users makes their own build, but it only contains a subset of plugins in the contrib build? What if they add just one plugin that they wrote themselves? What if a vendor provided the build? What if the vendor build only contains contrib plugins? What if it's the contrib build but their configuration file is absolutely insane? Technical support is really important, and telling someone "no we won't help you" is disappointing. So we need a really clear cut definition for what we are willing to support.

Maybe there are additional problems, but those are the two where I am currently seeing real world issues related to a lack of clear definitions around the Collector.

yurishkuro commented 5 months ago

Problem one: remove confusion about what a Collector is

I don't think this in itself is a problem. Whatever someone calls their binary doesn't concern me unless I have an actual problem to solve and their naming creates confusion preventing me from solving the problem (like coming to OTEL support group when the actual "collector" is something else entirely). So your #2 is an actual problem, but #1 is not, it's more like a possible root cause for #2. But #2 could be caused by other things too - a distro may actually be a "collector" as you want to define it, yet the question is about a custom or even proprietary plugin.

In other words, if #2 is the only problem you want to solve, it needs a policy of what is appropriate scope for support questions. There may be a definition of collector that helps this policy, but doesn't help other problems, such as one in https://github.com/open-telemetry/opentelemetry-collector/issues/8555#issuecomment-2166935956. And there may be other approaches to the policy rather than relying on "what is collector" question. Such as: go talk to your vendor who provided the binary, irrespective of whether it matches any definition of collector or not. I would actually be a strong proponent of that exact rule - vendors have paying customers, they can allocate resources for tech support, instead of putting this burden on oss volunteers in OTEL.

tedsuo commented 5 months ago

@yurishkuro number one is definitely a problem. We are actively addressing an example of it right now. It is related to number two, but it causes other fundamental confusions.

I agree that for most projects, #1 is not an issue – no one is going to name their project Redis. But perhaps because OpenTelemetry is something of a standard, there seems to be a natural inclination to imply that projects which process OTLP are part of OpenTelemetry even when they are maintained outside of the project, with the Collector being the main target. I don't think that defining what a Collector is needs to be difficult or complicated, but we should write it down anyways. We have other problems, their solutions don't need to be related to the sign on the wall we need to put up declaring that the term Collector only refers to this codebase.

jpkrohling commented 5 months ago

Thank you all for the renewed interest in defining Collector and Collector distributions. I watched the recording from last Thursday and spoke to several of you on Slack (GC and Collector leads). Here’s a summary of the situation as I understand it.

We already have a few definitions in place, such as:

Commercial vendors are being asked to support the "OTel Collector" by their customers, as evidenced by the number of commercial vendors listed as having a distribution of the Collector:

Each vendor has a different approach to meeting this demand. Some assist customers using a curated list of upstream components, others offer support (with SLAs) for their official binaries with vetted upstream components, and others provide extra features at different levels. These approaches are categorized on the distribution definition page as "Pure," "Plus," and "Minus."

However, not all of these approaches resonate equally within the GC and with Collector maintainers: we accept some approaches as distributions but not others. We can't pinpoint why they are different, making it harder for vendors to comply with the (non-existent) requirements to be called a distribution. The GC has politely asked one of these vendors to stop calling itself a Collector, without providing a clear path forward for the project to regain the right to be called a distribution. Lack of knowledge about these projects adds to the confusion. For instance, I have seen inaccurate claims about ADOT and Alloy.

@atoulme, @bogdandrutu, and @yurishkuro have questioned the actual problem we are aiming to solve. While their question might seem odd, there wasn’t a clear articulation of the problem: we feel that something is off but can't pinpoint why we don't want certain projects to be called a distribution of the Collector. One argument by @djaglowski was well-received: we want users to have a consistent experience and be able to reuse their knowledge when switching between "flavors" of the Collector, whether custom-built, vendor-built, or community-built.

I have also heard a few other arguments, which I'll address here:

To me, it's clear that we need an objective set of rules in addition to our existing subjective definitions, so the ecosystem can thrive with options for our users while retaining their ability to reuse their knowledge and switch between flavors without getting locked-in. If we can agree on this need, here’s what I propose as an initial draft, with the promise to develop it further elsewhere:

tedsuo commented 5 months ago

Thanks @jpkrohling that's a great layout. My only suggestion is that I think Collector Build and Collector Distro can be combined. Anything that can be reproduced by the builder can be called a Distro, regardless of who issued it.

jpkrohling commented 5 months ago

In my previous message, I should have stressed more that we didn't have a consensus on whether we had a problem to solve. Before addressing why I think we need a build and a distribution, I'd like to take a step back and have a consensus.

Community, Collector leads, TC, GC: please vote on this issue. The options are:

❤️ No problem to solve at the moment. Let the ecosystem use our subjective definitions (status quo) 👍🏽 We have a problem with the subjective definitions and need a concrete set of rules

Note that you are NOT voting on my draft proposal.

yurishkuro commented 5 months ago

Let me try one last time. You cannot solve a "problem" of "what is collector" without deciding why, i.e. what success criteria you want to meet by "solving" it. The poll above provides exactly zero answers to that question.

cartermp commented 5 months ago

Not sure how helpful this is, but this is my take from working with several hundred customers adopting OTel:

So I guess my experience is that there isn't a terrible problem here to resolve, but there is quite a bit of variation in what people use, and that sometimes leads to confusion or a bad experience depending on what they're using.

I see here echoes of what it means to adopt OTel. If you propose an alternative API, but still emit semantic conventions and OTLP data under the hood, is that OTel? I'd say yes. Is your binary, Acme Corp. Collector, capable of accepting and emitting OTLP, and also uses the batchprocessor with some different defaults under the hood? I'd call that a collector as well.

jpkrohling commented 5 months ago

@yurishkuro, please bear with us. Your input has been valuable and I think we are now in a better position because of your questions. I'll try again, starting with what I see as the problems we are trying to solve:

If we define we want to work on those problems, here are the goals for me:

austinlparker commented 5 months ago

I think the simplest way to conceptualize the 'problem' is that the only thing that the project defines as hard requirements for 'what is an OpenTelemetry ' is what's in the specification. This falls apart when you start talking about things like the collector - there's not really a specification for the collector. This can lead to not only user confusion (see above), but also confusion for vendors and integrators building in the ecosystem.

Ultimately, we need to be able to provide some guarantees to both of these groups -- to users, we need to be able to have clear guidance for questions like:

To builders, we need guidance around:

yurishkuro commented 5 months ago

@jpkrohling

Provide clarity: establish clear, objective criteria for what constitutes an OpenTelemetry Collector distribution so that both vendors and the community know what is and what isn't a distribution.

Don't you see that this is a pure tautology? "We want to know because we want to know". Any definition will match that. E.g. the following definition is clear and objective, and completely besides the point as it does not address the unspoken problems:

Consistent user experience: by establishing objective criteria, we ensure that users can have a consistent experience across different distributions if they stick to the aspects we establish, enabling users to switch between distributions without relearning or facing incompatibilities, while at the same time being able to use distribution-specific components or features.

This is getting closer to the issue, but it's very hand-wavy. @austinlparker 's comment https://github.com/open-telemetry/opentelemetry-collector/issues/8555#issuecomment-2181033886 is more concrete. Basically, we can approach this as a product requirement spec. Try to phrase everything as a use case:

"as a {user role} I want to {perform an action} so that I can {achieve an outcome}".

For example, with one of Austin's bullet points:

Phrased like this, an immediate question from me - is that what we actually want? How is that even possible? It means that the two distros are 100% functionally equivalent (at least on the features I already used with distro X), which defeats the purpose of distros in the first place. Ability to swap implementations is a nice theoretical goal, but there are other goals users may have, like I don't want to run binaries 100s of MBs in size bundling every possible feature.

So rather than keep debating completely arbitrary definitions of collector, let's first

  1. list what use cases we want to satisfy (aka "problems to solve"),
  2. whether we indeed agree that we want to satisfy them,
  3. and whether it's even possible to satisfy many of them at once (as a compromise).

Doing so will implicitly inform the definition of the collector, based on actual problems / goals / user needs, not based on a tautological definition of a problem.

austinlparker commented 5 months ago

@yurishkuro There is an immediate need for the collector, as a SIG, to define what the requirements of another piece of software calling itself an 'OpenTelemetry Collector' must align with. This is, as you said, a product requirement. I stated my rationale above, but I would like to expand on it with the bigger issue here.

As OpenTelemetry continues to mature and graduates, we (the GC and project leadership more generally) will need to create requirements around certification and compatibility. This is both easy, and hard. For instance, it is relatively easy to set a requirement around something like OTLP. If you write OTLP, then you must write valid OTLP to any compliant OTLP receiver. It is also somewhat easy to say 'Supports OpenTelemetry API' by ensuring that you can get the active span from context and modify it, etc.

The collector, however, is much more difficult to quantify by these standards. I agree, in principle, that it might not be desirable for non-specced config files to be portable. I would generally agree that a receiver written for upstream may not necessarily work with other implementations. With that said, what is the distinction that we are going to use? You can hopefully understand my reluctance to say "Ok, well, you can just call anything that receives OTLP a Collector" because that could be very confusing for users, especially as management tools proliferate. Similarly, it does not benefit users to remove one source of lock-in (the API/SDK) then replace it with another (the pipeline/collector layer).

I would honestly be fine saying 'there is only one thing called an OpenTelemetry Collector, and it is anything that is built with upstream ocb'. Everyone else in the ecosystem can be 'OTLP compatible' or whatever other words we come up with.

edit: By 'non-specced' config files above, I mean configuration files that do not align with a published specification (eg, the upcoming file-based config options)

austinlparker commented 5 months ago

Just to be crystal clear -- I think an entirely acceptable outcome of this is stating the following:

yurishkuro commented 5 months ago

There is an immediate need for the collector, as a SIG, to define what the requirements of another piece of software calling itself an 'OpenTelemetry Collector' must align with. This is, as you said, a product requirement. I stated my rationale above, but I would like to expand on it with the bigger issue here.

@austinlparker I am sure you're familiar with the monitoring principle "don't alert on root causes, alert on symptoms". The motivation is that there may be many different root causes of an issue, but if it does not affect user-facing behavior it's not worthy of an alert, and vice versa, if the user experience is affected you should be alerted regardless of the root cause. In your paragraph, "what to call a collector" is a (possible) "root cause". I am interested in the "symptoms".

If my previous quote (https://github.com/open-telemetry/opentelemetry-collector/issues/8555#issuecomment-2167135929) didn't hit the spot, here's another one:

“What's in a name? That which we call a rose by any other name would smell just as sweet.”

tedsuo commented 5 months ago

@yurishkuro I have three concrete issues that can be resolved with definitions.

The first is support. I would like us to only handle technical support requests for the code that we own. We issue two binaries, so we support those binaries. Anything beyond that we don't really want to support.

The second is end user confusion. I continuously get questions from end users about "what is a Collector Distro?" Many users think that these are forks of the project. For example, I have heard people complain more than once that OTel is a fractured project because every vendor has forked the Collector. Defining what a "Collector Distro" is would help a lot with these misconceptions. I am really tired of answering these questions and putting the same misconceptions to rest over and over again. And yes, I often get asked the question "where is this all defined?" when I explain this.

There is now even further confusion, as projects are now appearing that contain some Collector functionality, but also contain significant additional functionality that has nothing to do with the Collector. If these projects primarily refer to themselves as "Collectors" or "Collector Distros," then there really is some kind of fracturing happening, as those projects contain functionality that could not be added to other Collectors – you must use that specific non-collector codebase in order to access those features. The same goes for Collector binaries that contain private plugins – it's a fork because you are now completely dependent on this third party organization for this functionality. If projects like these can be considered Collector Distros, that the term would be meaningless and cannot solve the first problem. So I want another term to describe these projects.

@jpkrohling's proposal is very close to what I want, as it differentiates between:

Those definitions would go a long way to resolve the end user confusions I have encountered to-date around the Collector.

tedsuo commented 5 months ago

And to clarify a point: if the Collector project was completely rebuilt in Rust, that would also be labeled as "Collector Compatible," because it is not a Collector! Even if it was the OpenTelemetry organization that rebuilt it, I would still require that we give the rust project a different name, because once again it would be confusing as hell for us to have two separate codebases that were called the same thing.

codefromthecrypt commented 5 months ago

TL;DR; let's split off support policy from branding guidelines, solve support first.

Hello folks I know and don't yet! I have some comments as it is near and dear, and relevant at Elastic as we have distributions etc. My personal opinion on this is grounded by a bias towards precision on what is and not supported, driven by the desire to both scale support and also not confuse users. I accept I have no first hand pain in this org, yet, but the pains are familiar in past lives. I want things to work out as otel is my day job now, but please understand my comments are not grounded in the practice here, yet.

In Zipkin, what is and isn't supported was incredibly important due to so few hands available. We said effectively no support of custom builds, it didn't matter if that was a saas variant or a customer provided one. Binaries we publish are supported and that's it. So, to get support you at least have to reproduce it on a standard build, even if in prod you use something else. The latter is a neat trick to get support on a custom build really ;) I think most of this thread, and the most grounded and actionable parts are very very similar, even if the scale of otel is a lot larger and will have different outcomes.

What's less clear is both the concerns and what do to about brand misuse and abuse. I'm sensing a desire to create specific brand advice or even mandate on the Collector term, with focus on third party vs vendor offerings. My gut feel is that if this is desired, it is a different issue and should be driven by tests. In fact I would go so far as to say rust could be a valid collector if it is managed here and passes same tests. In such a way a private or 3rd party build could know if they conform to a know build and there is a chance of future certifications for commercial providers. I feel this should be a separate issue as in issue should result in something specific to solve with a clear rationale.

Finally, and this is just a conjecture, but I don't think we benefit by trying to limit how people can integrate technology without very clear whys, such as where it causes a support problem or undue confusion. For example, a service that embeds a collector may not be able to say it is exclusively a collector, but.. is that a bad thing? Even if it is unadvised, to we want to police it? This particular topic is more a conversation than an issue to me. I feel it is related to the two above, but possible something to revisit when both above are solved. Actually, I feel it might be best to table this part.

My 2p and hope it helps

svrnm commented 5 months ago

+1 for the suggestions @tedsuo made in this comment for definitions of "Collector", "Collector Distros" and "Collector Compatible"

I also agree that support policies and branding (or wording) are related issues, but need to be treated independently. The service we do for support by being precise in our words is that we remove confusion and can identify situations easier that our out of support, but there will always be end-users using terminology incorrectly and confusing things. And, to add that ultimately the support policies are in the LICENSE: Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND.

Note, that this discussion here is an example for why we need a process on how can we encourage consistent terminology across the OpenTelemetry project:

tedsuo commented 5 months ago

Hi @codefromthecrypt, welcome back!!

Let me clarify why there should be a difference between "Certified Collector Compatible" and Collector Distro."

Certified Collector Compatible would definitely be a test suite. I would suggest modeling it on Certified Kubernetes Software Conformance. And it would not be an attempt to make integrated collector technology a second class citizen. Quite the opposite, I see it as a positive thing. Through the CNCF, we would have a Certified Collector Compatible trademark with a logo that projects can place on their website. I think this would help encourage adoption of OTLP and other protocols like OpAMP. And since it is an API conformance test, it would not limit projects to Go, or needing to use our codebase.

The issue with such a test suite, and why it would be different from a Distro, is that we need to pick a set of APIs that are required to pass this test. The Collector is almost entirely plugin based. You could build a Collector with no functionality at all. So Collector Compatible would involve deciding what minimal set protocols and features we would like to see embedded in other projects. Alloy is a perfect example of this – it's fantastic that the Grafana agent has been extended to process OTLP and use OpAMP for remote configuration.

So, given we have a Certified Collector Compatible program, what is the purpose of Distros? I originally coined the term "Collector Distro" to mean a set of publicly available collector plugins. End users should be able to write their own plugins and roll their own Collectors using the build tool. We want to encourage vendors to make this possible. If vendors add functionality that cannot be included using the build tool, then end users are now forced to use that specific binary to gain certain functionality. This means that they can no longer pick and choose what plugins they want to use. That limits a very important feature of the Collector, and would represent a fracturing of our ecosystem.

Distros don't have any requirement to include any specific plugins or APIs. For example, it's extremely normal for users to build Collectors that do not include OTLP, because they are not using that protocol. They might not even include tracing at all. But they are still a Collector, built out of Collector plugins using the build tool.

I do not want to lose this definition of a distro, as it has been very helpful to the community over the years. But I also want to encourage other projects to implement a set of "OTel best practices" that would make them work well in our ecosystem, regardless of how they are implemented.

Is that a helpful explanation for why both terms are needed? I see it as a net positive to encourage both behaviors – adoption of the OTLP protocol and other features for projects which are not Collectors, and a "roll your own" plugin-only architecture for projects which are Collectors.

codefromthecrypt commented 5 months ago

(Hi, Ted and Yuri, and all ;) good to be on the same team!)

Some takeaways as the person with least context until I lose that benefit.

  1. @svrnm forked an issue about terminology concerns and some real impacts of them. This issue is centered on on part of that, related to the collector
  2. Ted raises a good point on difficulties of validation of collectors esp around protocols that may not end up in use.

I personally can get behind "Collectors" and "Collector Distros" as I think it would be easy enough to explain to end users, and also ties into the current use of distro even if later that definition may need polishing. The part more difficult and not implemented yet to my knowledge is the "Certified Collector Compatible" thing. See below, but..

Could we make a call on changes or refinements to the terms "Collectors" and "Collector Distros" in support of.. end user support, and close this with a PR of those changes which links to 2165


We still have 3: "Certified Collector Compatible" and that's ok, but a different thread.

I think "Collector compatible" or whatever that turns into is a larger topic for reasons discussed and still believe it should be forked into a new issue. Personally, I like feature based stuff and let that drive the name, ideally not another name that has "collector" in it ;) I may be wrong, but I sense the third overload of a term could be diminishing returns especially in how tests might pan out. "Semantic compatible" feels a bikeshed I would wander into.

In that issue, I would blather about the highest value in compatibility is the data compatibility, and cloud escape. I would ramble about OTLP and structure being important but switching is really about being able to pull the same semantic data out as you put in. This seems more important than which plugins are enabled, or even if the same codebase is in use (if you could even tell!). In the docs resulting of it, we can say the easy path is to use tech like my company does, the collector builder, and that a practice of unreliance on custom distros is a great way to remain certified and cause least grief to your users. I know this wouldn't work for tools who have lossy conversions, but anyway we have to choose what to optimize for, if the goal is coherence and a sentiment that these collectors are not forks.

My 2p.

jpkrohling commented 5 months ago

End users should be able to write their own plugins and roll their own Collectors using the build tool. We want to encourage vendors to make this possible. If vendors add functionality that cannot be included using the build tool, then end users are now forced to use that specific binary to gain certain functionality.

People seemed surprised when I brought this up elsewhere, but only 3 out of the 7 distributions I listed previously would comply with this. In fact, AWS Distribution of OTel Collector was already available before I created the builder. The ones that are currently being built with ocb are:

Conversely, as pointed out by @djaglowski a couple of times, the Collector framework is so flexible that you could have a valid builder manifest file and a binary that doesn't look like a Collector at all.

This is why I think we need all four definitions I linked in the proposal and reserve "distributions" for our internal usage, meaning that it's a "distribution of OTel Collector components" (or "plugins" in @tedsuo's message). To recap:

This way, vendors can make it clear to their audience that:

tylerbenson commented 5 months ago

(You can add Datadog to the list now too...)

mattdurham commented 5 months ago

Grafana Alloy developer here, would love the conformance tests. This would allow a binary decision of pass fail. Interested in seeing where the definition of a system that supports otel config even if not built with the ocb. This seems to be what Datadog has done and is what Alloy has limited support for.

codefromthecrypt commented 5 months ago

@jpkrohling so what I understand from your opinion is that open telemetry should exclusively be the only ones that can use the word "distribution" and the ones you mentioned that are already using the word distribution should rename.

If I understood incorrectly correct me, regardless can you suggest what names these products would use instead? While I like the idea in concept of a universal binary, some challenges of this have been noted including pressure for all potential vendors to become a part of that binary.

You could imagine 30-40 vendor plugins in the main "distribution" if somehow we become unfriendly towards extension. This can create some interesting dependency problems. Imagine a worst case of a several gigabyte binary and all the maintenance of version conflicts as people try to be a part of the only way out. Imagine if all experiments had to be shipped in the only valid distribution prior to being usable. Another challenge is realities such as FIPS builds, are we planning to do all the things vendor builds often have to? Basically, I'm not saying to make the "official" binary strong is a problem, but there are costs to it especially if we restrict terminology.

Even if we did, I think people tend to know "official" is different than "distribution", even if I personally agree the latter isn't crystal clear. Just I think that doing something like restricting the word "distribution" has a lot of side effects, especially as used elsewhere and already here. It could come off draconian and unfriendly to more contributors and cause displacement of efforts into renaming instead of productivity.

Basically, I feel like if we are worried about brand abuse, we shouldn't "throw the baby out with the bathwater". Find a way to incentivize alignment of branding and use clear words that the binary we ship is the official one. My suggestion of a test suite was around this.

Basically if you have a comformance suite of some kind then we don't need to punish folks for using the very frameworks we suggest. We also acknowledge special case, FIPS etc builds should be encouraged, and the core should remain lean.

I know I've used some extreme ends to make a point, but it is mainly to highlight there is some damage re-purposing the word "distribution" in a perhaps surprisingly exclusive way. I think the TL;DR; is let's step softly into this space, pitch example impacts and remember why we are doing things.

codefromthecrypt commented 5 months ago

p.s. the reason that if we are suggesting terminology change, practice them, is it could help identify some strangeness.

e.g. OpenTelemetry distribution of OpenTelemetry Collector is for lack of better terms weird because if you are OpenTelemetry you don't need to qualify anything. I wouldn't expect "Grafana distribution of Grafana" on a downloads page either ;)

So, basically if we have some changes, lets practice both with real world existing distributions, and if we repurpose distribution to also our own, then clarify how that would look. Possibly compare it to other similar tools in the ecosystem who would have the same concerns as ours.

More musings below

I do get the issue with "collector builder" being insufficient to represent something being a canonical collector. It is sort of like saying all things spring boot are distributions of spring boot. I'm not saying this topic isn't tricky, but I think we should practice more if changing things as the blast radius of impact isn't tiny in a project this large. To be objective means how to measure and we should focus on that before disallowing names already used.

Issues are best with concrete outcomes and while my opinions are non-binding, I think if the outcome is to restrict the word distribution, a quick poll would lead to this issue being closed. Especially as this is a part of linux foundation and literally distribution is well understood in this context. That might be the best thing as this issue seems more a discussion than a concrete thing to do with a specific rationale otherwise. We could then salvage movements forward in more focused issues.

tedsuo commented 4 months ago

Getting back to this now. There were further discussions at OTel Community Day, and I believe I have a clearer understanding now about what makes a Collector Distro a Collector, vs a piece of software that happens to use Collector code.

The answer is pluggability. The Collector's flexibility is at the heart of what the Collector is. It is hugely beneficial for operators to have the ability to mix and match plugins and features.

A particular distro might have some features that an operator would want. But they may want to add additional features via plugins, including plugins that they create themselves. If a piece of software allows them to add these plugins, that software has preserved this critical feature.

On the other hand, if a piece of software offers features that can only be gained by that particular build, and additional plugins cannot be added, then users are faced with a choice: use the features of this software, or forgo those features in favor of a Collector where they can add their own plugins. In other words, they can either use this software or use a Collector.

From a practical perspective, this is an incredibly useful definition of a Collector Distro:

Note that this definition does not define how the software is built, or how additional features may have been added to it beyond what Collector plugins could provide. This definition also avoids tying the distro to the specific builder that OpenTelemetry maintains. It just focuses on the features that make a Collector a Collector: configuration and pluggability.

Looking at the landscape of software that have already been acknowledged as Collector distros, I believe that this definition would apply. ADOT would fall under this definition; even though it has a different main file and lives in a different repo, it does not offer any additional functionality that could not be obtained by the OTel builder. The definition we are using is practical, not technical. This leaves to door open for the widest possible adoption of the Collector model (good) while avoiding a situation where these additional projects fracture the landscape from an Otel perspective (bad). It also matches my original intention when I coined the term OpenTelemetry Distro way back when, which is important as it doesn't pull the rug out from under anyone.

For any distro that currently cannot be built with additional plugins, my intuition tells me that adding a build tool would not be an unreasonable burden. @mattdurham would this be the case with Alloy? Again, I don't mean that Alloy would need to use the Otel builder, just that Grafana provided an Alloy builder that accepts additional Collector plugins.

This issue is kind of a blizzard, so I'd like to discuss this proposal a bit and then move it to a PR if there seems like general consensus that this is the right direction. I also want to open a separate issue to focus on Otel Collector support, as that is clearly a separate topic at this point. There is other discussion being raised here and that's fine, I just want to keep things a little organized so that we can move forwards without multiple important conversations happening in the same thread. :)

tedsuo commented 4 months ago

Okay, support discussion has been moved here #10561

mattdurham commented 4 months ago

Its possible though hard to say the difficulty. Purposefully ignoring code level issue discussions, like go.mod issues. Would the collector configuration only include items within the yaml file or include items that exist outside the yaml?

Do you foresee that their would be backwards compatibility promises around the Collector Distribution? With the expectation that users can switch between distributions then having promises feels like that would be required to allow users to easily switch between distributions.

From my perspective the main drive would be to talk OpenTelemetry protocols and accept Collector configuration. Allowing users to switch with minimal burden. Maybe this is not a distribution but some sort of lesser label.

jpkrohling commented 4 months ago

Sorry for only coming back to this now, I wanted to process all the comments from here and see if we had further comments from the community.

I think it's worth pausing here to see if we have clarity on the problem we want to solve (thanks, @yurishkuro!) before proceeding. It's clear we have a problem to solve, but which one?

Based on the thread so far, I believe these are the things we identified as problems to solve:

During today's SIG Collector call, not everyone agreed that everything on the list were problems or problems worth solving. @atoulme suggested that we select one problem and solve that first. In that spirit, I'm splitting each problem on the list above on its issue. The suggestion was to start with the "lock-in" one (the last one), but I think we can work on them separately. I'm changing this issue to be a tracker for them, and I encourage everyone to open issues for problems that I missed. My hope is that working on the individual issues will make it clear to us what the solution could be. For instance, it might be the case that conformance tests might solve a couple of those problems.

I also wanted to address some of the points and questions from the previous comments:


@codefromthecrypt:

can you suggest what names these products would use instead?

Looking at the current distributions, I think I would be OK with seeing a distribution called "Liatrio Collector" (just to pick a random one from this thread). Just not "Liatrio OpenTelemetry Collector". The marketing materials could then have something like, "Liatrio Collector is [compatible|certified|...] with OpenTelemetry Collector". My very personal opinion is that they can also be creative and get product names for their products, making the same "compatible|certified" comment later on. Of course, if we decide to provide the compatibility framework.

While I like the idea in concept of a universal binary

I don't think the universal binary is something we should aim for, also because of the reasons you've given. I believe that if we had better tooling, we'd have more users building their collectors. One idea I had in the past and that I never implemented was to have something like start.spring.io, where users could select which components they want, and they would end up with a manifest (or, why not, binaries/containers).

For most of the same reasons, I see us splitting contrib (the repo) further in the future. And I hope we can deprecate contrib (the distribution) in the future as well.

if somehow we become unfriendly towards extension

I'm a firm believer that OTel is also about choice: I don't think we should ever be unfriendly towards a healthy ecosystem of downstream projects and products, most of which are built by the people working upstream as well.

Basically if you have a comformance suite of some kind then we don't need to punish folks for using the very frameworks we suggest

This is what we might end up with; it's the point we all keep returning to.

@tedsuo:

mix and match plugins and features

Can you clarify what you mean by those in OTel Collector terminology? Are those what we call "components" (receivers, processors, connectors, exporters, extensions)?

Must accept a collector config

We'd have to define which components are part of this config, which means we are defining a minimum set of components that have to be part of a distribution. Or do you mean just the skeleton of the config file, without components being specified?

Must be buildable to include additional Collector plugins, without forking

That's the ADOT distro + Honeycomb component that was used as an example during the GC call, right? It means that ADOT would have to provide a tool for users to include Honeycomb's component there, without requiring users to fork ADOT. Is that the spirit of the proposal?

@mattdurham:

Would the collector configuration only include items within the yaml file or include items that exist outside the yaml?

I might be missing a nuance to your question, but yes: the Collector is aware only of things that are defined or used in the config. The only exception could be the config providers (for things like "value: ${env:SOME_ENV}")

Do you foresee that their would be backwards compatibility promises around the Collector Distribution?

We have some already, both for users and for downstream distributions. I expect this to get even stronger with v1.

adrielp commented 4 months ago

As someone who has built, currently maintains, and encourages clients to do the same with their own distros, I greatly appreciate the utility of OCB. When I started that endevour, I wanted to keep my distros as close to the source in structure so as to make maintainability easy.

I think it'd be amazing if there was an official OpenTelemetry GitHub Template Repository (or something similar) that included all the GitHub Action Workflows, Make file, directory structure, manifest, dockerfile, etc for folks to build a distro that has a repo structure matching core & contrib (helps with consistency and understanding).

In this case the distro would be easily configurable (OCB manifest), easily understandable in terms of what components are there (OCB manifest), easily update-able through automated dep manager, and I'm sure more.

The byproduct is that it would make it easy for end-users to build OTEL distros that match the "what is a distro" opinion.

A good platform makes it easy to do things the preferred way, but leaves room for other ways too.

I've looked at several distros & the way they're built and there are some out there that are hard to grok. Not easy for the end-user. And even harder for an end-user to build their own distro using components that may be exclusively in the vendor's "distro." That leads to vendor lock-in, ie. "Oh, I'm in this Cloud platform and to get my telemetry I really have to use their distro and not my own or others."

Additionally, we could provide a template for generating all the boiler-plate for components. My coworker and I have already done this very thing for creating new scraper based receivers. We dubbed the CLI utility compgen.

Hopefully I didn't detract from the core of this thread too much. Feel free to say it's a bad idea 😊

jpkrohling commented 4 months ago

Agree on your two proposals, @adrielp. There was an attempt a couple of Outreachy mentorships ago to develop a component creator, but I don't know what's current state, unfortunately: https://github.com/Chinwendu20/otel_components_generator . The biggest challenge is that it would be one more thing for Collector maintainers to take care of.

About the template repository, I'm in favor of having it, as it would be a separate repo anyway, and we can have different code owners/maintainers for that (you could one of those).

Could you open new issues to track those two proposals?

adrielp commented 4 months ago

Done @jpkrohling - issue #10681 and #10682 have been created!

codeboten commented 3 days ago

I'd like to move forward with closing this issue. After many discussions at Kubecon with various OTel collaborators, i'll try to capture here what was agreed upon as "What is a Collector".

The goal here is to document expectations for both end users and for vendors.

A Collector is a mechanism that:

A Collector MUST allow users to bring their own components, to ensure no vendor lock-in can occur.

A Distribution is a package that is produced by utilizing open source tooling maintained by the OpenTelemetry project and contains any combination of components.

An OpenTelemetry Collector is a distribution of a Collector that is developed, distributed, and maintained by OpenTelemetry contributors and maintainers.

I will propose including the above definition in the spec, and close this issue in 1 week if there's no objection. We can always further discuss the definition details in the spec PR

mattdurham commented 3 days ago

Functionally today does that mean that a Distribution must be built with the OpenTelemetry Collector Builder? Under the clause MUST be produced using configuration or a manifest that is compatible with open source tooling developed, distributed, and maintained by the OpenTelemetry project

codeboten commented 3 days ago

Functionally today does that mean that a Distribution must be built with the OpenTelemetry Collector Builder? Under the clause MUST be produced using configuration or a manifest that is compatible with open source tooling developed, distributed, and maintained by the OpenTelemetry project

It doesn't have to be packaged with OCB, but it must provide a mechanism that is compatible with OCB