open-telemetry / community

OpenTelemetry community content
https://opentelemetry.io
Apache License 2.0
796 stars 238 forks source link

Advocate with cloud providers for pure OTLP endpoint implementations #984

Open atrauzzi opened 2 years ago

atrauzzi commented 2 years ago

I'm not sure if this is the best place to at least offer a starting point for my concern, but at least wherever things end up, it'll be tracked here so that it's searchable...

I'm in the process of getting my company established on Google Cloud. One sticking point which I also encountered while on Azure is that all the major cloud providers seem to be under the impression that in order to support OpenTelemetry, they all have to provide a vendor-specific exporter library.

My understanding is that part of the whole point of OpenTelemetry is not just to offer a consistent and agnostic API surface area, but to also offer the OTLP wire protocol for traces to be exported.

I feel like it's a glaring failure of advocacy and communication from OpenTelemetry (as an overall initiative) that all vendors seem to be requiring application developers to modify their source code to get support for tracing.

Is there no way that OpenTelemetry can reach out to all its senior engineering contacts at the various cloud providers which I know participate in the project and help guide them to establish well-known OTLP endpoint conventions on all their compute resources for applications to export to?


For example: If I'm running a C# application on Google Cloud Run, I should be able to instrument my application using the standard vendor-agnostic OpenTelemetry libraries, while also configuring my application to use the vendor-agnostic OTLP exporter. When configuring my application to run on Google Cloud Run, I should be able to simply update the endpoint URI that my application exports its telemetry data to, at which point, all of Googles cloud monitoring infra would begin receiving trace data from my application.

In this scenario, Google would not be responsible for providing ecosystem-specific exporters, but instead able to focus on integrating at an infrastructure level by exposing well known telemetry endpoints on their various offerings.


Instead, what we currently have is that applications have to install vendor-specific exporters which means that if you don't control the source code for an application, you're basically left without an option for getting telemetry out. If your platform isn't directly supported yet by the vendor, then you also have no recourse and simply cannot adopt OpenTelemetry.

Overall, this ticket is to raise concern that OpenTelemetry also has some obligation to give suggested guidance not just on how developers can operate with its own deliverables, but how service providers are to capitalize on the standard. Not just saving developers time and frustration, but also saving themselves the effort of having to maintain libraries for each possible combination of services and ecosystems!

tigrannajaryan commented 2 years ago

Instead, what we currently have is that applications have to install vendor-specific exporters which means that if you don't control the source code for an application, you're basically left without an option for getting telemetry out.

This is not entirely true. You can send from OpenTelemetry SDK to the Collector using OTLP and from the Collector to the vendor using vendor-specific exporters that we have in the Collector. The typical configuration is Otel-instrumented application sending via OTLP to the Collector running on localhost and that is reflected in the default SDK settings.

I do agree with the general sentiment that we can do better in promoting OTLP.

Keep in mind that until recently we only had traces portion of OTLP stable, metrics became stable recently and logs are not stable yet. This adds some reluctancy on the efforts to add support for OTLP by vendors. Nevertheless, metrics are stable now and logs are nearing stability so we are now better positioned to market OTLP more widely.

atrauzzi commented 2 years ago

You can send from OpenTelemetry SDK to the Collector using OTLP and from the Collector to the vendor...

I'm aware of this scenario but didn't specifically mention it as it would be impractical in my Cloud Run (or any other PaaS like scenario).

I'd also offer - warmly :heart: - that such a suggestion really doesn't do the cause any justice as it allows cloud vendors to push compounding complexity and maintenance burden on people who simply want to be able to extract the benefit out of a standard. Like, we really mustn't make light of "running a collector". That's basically a VM or sidecar in a deployment that may not even be able to support a VM or sidecar.

Otherwise, yeah... I just think that effort needs to hit hard and hit fast because vendors are entrenching around the worst-case scenario I described. Even so far as having their support channels regurgitating it as convention. Which also makes it very difficult to communicate the desire to see the proper approaches taken.

tigrannajaryan commented 2 years ago

I agree with you. We should recommend OTLP on all hops, with or without Collector, and at the vendors' ingest endpoints.

As an example I think we need to highlight OTLP in the docs at https://opentelemetry.io/docs/ We don't really do it a justice, there are just a couple scattered references to OTLP exporters on some pages.

dyladan commented 2 years ago

Maybe a list on the website of tracing backends known to support native OTLP helps this issue? I would suggest something similar to the openmetrics compliance program but I think that might be a lot of work and a simple list is a quick win. I know some vendors at least already have it and many would jump at the chance to be listed on the website in an official capacity of some kind.

atrauzzi commented 2 years ago

It woud be nice to see some proactive outreach as part of whatever else is done. If only because vendors are already running with the ball in the wrong direction. A direction which of course also implies lock-in :cry:

I'd also say it's a must for compliance. Like, this is exactly the thing that people are adopting OTEL for...

dyladan commented 2 years ago

If only because vendors are already running with the ball in the wrong direction.

At least the vendors that contribute to otel seem very willing to enable OTLP ingest. There are several vendors who already do this. I don't want to list them for fear of missing someone and making them sad but they're out there.

jsuereth commented 2 years ago

@atrauzzi Regarding your comment on Google Cloud OTLP support, I can confirm the message is heard.

atrauzzi commented 2 years ago

@jsuereth - That's really awesome to hear. I know timelines are hard to provide so let's say I'm not looking for one, but is there a good place to track progress or even make myself available for discussions and testing? Basically would just like to keep up and offer input.

Again, we're using .NET 6 and OTEL, and I'd really like to be able to send all that data into monitoring. Right now it seems like Honeycomb is the only vendor that got it right :tm:

sharrmander commented 2 years ago

Hi @atrauzzi -

It's great to see end-user feedback directly so thank you for your engagement!

I made an assumption when you said 'cloud providers' that you meant the big three cloud platforms - so I didn't think your statement applied to observability-focused vendors like Honeycomb. I know a lot of us are strongly opinionated that pure/native OTLP ingest is the best path forward for the industry (and end-users!) and to that, there's a growing number of backend vendors that natively accept telemetry via OTLP. I can't vouch for all these experiences, but here's a centralized list of vendors offering native OTLP support, no collector needed :)

I'm curious about what you mean when you say:

Honeycomb is the only vendor that got it right™️

What about your experiences with Honeycomb gave you such a positive impression?

atrauzzi commented 2 years ago

@sharrmander -- No problem, engagement is something I do, I'm sure it's both refreshing to some and frustrating to others. :sweat_smile:

Thank you for the list! I was not aware that it existed. Generally I do mean the big vendors, although my specific experience is with Azure and GCP. Both of which managed to get this wrong in exactly the same way.

As for Honeycomb, I've never actually used the product, though I've seen a lot of Charitys advocacy. If a cloud vendor has proper OTEL support on the roadmap, I wouldn't really be able to justify the extra spend for the egress and service for something like Honeycomb. But that said, it's clear to me that it's a good product for those who can make use of it. Particularly because the company gets out in front of everything and seems to have a well honed and progressive technical instinct. Again, that's all just by its public persona and community presence, not as anyone who has used it...ever. :laughing:

atrauzzi commented 2 years ago

On a separate note, I think it's disingenuous to use the phrase "native support" on this page for anything less than "Native OTLP". That is to say, Azure and AWS should not be on that list because they simply cannot be considered as having an offering that's at all desirable.

As a developer, their OpenTelemetry stories are full of landmines if the support isn't as dead-simple as "Configure this endpoint for your OTLP exporter in your application, wipe hands on pants."

sharrmander commented 2 years ago

Thanks for clarifying @atrauzzi.

I appreciate that OpenTelemetry is many things, so while the data transmission protocol is important, is only part of the value prop of OTel. So I can get behind 'native support' on that opentelemetry.io webpage; especially for end-users who need, for example, the assurances that the upstream project has been performance tested to AWS's standards as part of their ADOT distribution.

atrauzzi commented 2 years ago

I don't know, that doesn't make sense to me. "ADOT distribution" contradicts what OpenTelemetry was supposed to be in the first place which was one way to instrument and thus one way to export.

Allowing vendors to shoehorn caveats in and then taking credit for only partial execution erases the value that everyone should rightfully assume is implied by the name "OpenTelemetry". Performance testing AWS-side should be that they performance test a setup that uses pure OTLP. I don't think any vendor should be rewarded for twisting what most assuredly will be broadly assumed when the word "supported" is used.

Letting vendors get away with any less cheapens the OpenTelemetry "brand" (for lack of better terms).

mtwo commented 2 years ago

Thanks for creating the issue @atrauzzi! @yurishkuro raised this during today's governance committee call.

We discussed the following:

atrauzzi commented 2 years ago

@mtwo Amazing! Thank you so much.

Is there any chance the vendor list can de-emphasize vendors that aren't "pure" OTEL? Perhaps have two lists, so that any vendor who wants top-billing on the page has to go all the way?

Honestly, it would send a really strong message. Bigger list at the top with a richer entry, smaller list lower down of "aspirants". Makes it clear that they have work to be done and that simply making it to the page doesn't mean "job done".

yurishkuro commented 2 years ago

I think it's fine to have one list with green checks / red crosses in the columns.

atrauzzi commented 2 years ago

Do you have any reason why "you think"? I've listed some good reasons so far that go far beyond just "I think".

Homogenizing the list won't incentivize better support because the most minimal effort will get equal recognition as a more complete effort. WHO does it serve to systematically reward that? OTEL should show some self-interest here.

yurishkuro commented 2 years ago
dyladan commented 2 years ago

We could have something like ⚠️ for "OTLP ingest may require a collector, custom exporter, or custom SDK distribution; please check vendor docs for details"

atrauzzi commented 2 years ago

It's better, but not great. I will continue to emphasize the risk of underestimating the optics.

As someone who has consumed several cloud services from different vendors, all vendors are in the business of saying "yes". Even if it means being disingenuous about it. The most frustrating thing we could do to people who are the target audience of the list (developers!) is giving vendors a way to coopt OTEL in conveying a false impression. Which also undermines the brand and reputation of OTEL itself.

Two lists is best. They can be structurally the same. Just make the second list a little smaller than the first one and put it further down.

Again, remember who these lists are for, who will be consuming them and why.

nerochiaro commented 2 years ago

Hello, I hate to be "that guy" but... the list still has no explanation to what "Native OTLP" means.

In my mind, it means they run an OTLP receiver, and support most of the features, at least for things like traces that are at a good level of maturity.

But it doesn't seem to be the case. Take for example datadog. They are listed as "native" but (as far as I can tell) you need an exporter to send them traces, and they don't seem to support relatively established stuff like span links or span events.

So, what does "native" means, and how is that list supposed to be helping anyone choose a vendor ?

tigrannajaryan commented 2 years ago

I believe "native OTLP" means the backend is able to receive OTLP. If datadog doesn't then it needs to be fixed in the list.

yurishkuro commented 2 years ago

@tigrannajaryan I agree with people that the current page is pretty unhelpful since it doesn't even define what the columns mean. At best it's "this vendor has 'something' related to OTEL".

We need a concrete proposal of what additional columns to add to the table, and how to go about populating those columns. Because the existing "native OTLP" is already misused, I would suggest resetting all vendors in the list to a question mark and asking them to file a PR that changes the values as needed while providing the evidence. We also need to provide clear definition of what ❌ ⚠️ ✅ values mean for each column.

yurishkuro commented 2 years ago

Concrete proposal:

Language is an important dimension for the table, so just having SDK or Distribution is not sufficient.

tigrannajaryan commented 2 years ago
  • 📦 - Distribution

So this about an SDK distribution for the particular language, right? Vendors can also have a Collector distribution, so perhaps have a separate column for that.

tigrannajaryan commented 2 years ago
  • ✅ - OSS SDK can be used (requires native OTLP)

Maybe label this differently than "OSS". Typically vendor distributions are also open-source.

yurishkuro commented 2 years ago

Maybe label this differently than "OSS". Typically vendor distributions are also open-source.

+1 - Official OpenTelemetry SDK

yurishkuro commented 2 years ago

So this about an SDK distribution for the particular language, right? Vendors can also have a Collector distribution, so perhaps have a separate column for that.

Perhaps do the same for collector as for SDKs, a single column with

svrnm commented 2 years ago

While I agree that https://opentelemetry.io/vendors/ needs to change, I have my issues with running & maintaining such a complex list: we can of course ask vendors to update that list once or from time-to-time, but eventually the burden to maintain the list lays with the Comms SIG, which takes away bandwidth from other things we urgently need to do.

So, what does "native" means, and how is that list supposed to be helping anyone choose a vendor ?

I don't think it is the responsibility of the community to help end-users making a choice which vendor to use.

cc @open-telemetry/docs-approvers

nerochiaro commented 2 years ago

I don't think it is the responsibility of the community to help end-users making a choice which vendor to use.

Fair enough. But in that case, why not just delete that list ? I think either it provides useful information, or it's better for it to not be there at all.

yurishkuro commented 2 years ago

Deleting the list is also a viable solution. But as was argued here earlier, the list not only benefits vendors, the project also receives value from it by showing industry adoption and steering users towards vendors supporting native OTLP. If we focus just on this aspect, we can simplify the table to have just the vendor name with a link to their own description of OTEL support, and the Native OTLP column (but clearly defined). Ie I would remove the distro column.

tigrannajaryan commented 2 years ago

the project also receives value from it by showing industry adoption and steering users towards vendors supporting native OTLP.

This is very important for us (for Otel). Precisely for this reason we should not delete the list. I am OK with rethinking it and simplifying maintenance, but I think it needs to stay in some reasonable form.

svrnm commented 2 years ago

Keeping the list simple (native OTLP: yes, distributions: yes, ... etc) is OK with me, my worry was with the all-languages + collector table which is an explosion of data, I am not keen to maintain.

Here's my proposal:

  1. We ask vendors to revalidate their row until DATE like the following:
  2. Bring proof that your backend supports native OTLP, that you have a distribution or that you require an exporter. Those proofs have to be a link to their docs for showing OTLP support & a link to their distribution/exporter. Those links will be included in the table.
  3. When we pass the deadline DATE all vendors without an update will be removed until they get back with that data.
  4. If a link is broken, we will set it back to "NO" and let the vendor now that they need to update.

Additionally we will remove the "Learn More" column since the links brought as proof will have all the end-user needs to know. If we like we can add additional columns eventually, e.g. if there's a vendor-specific collector, or if there's a fork/blog/doc around using the otel demo, etc.

tigrannajaryan commented 2 years ago

you have a distribution

Collector distribution or SDK distributions?

nerochiaro commented 2 years ago

Keeping the list simple (native OTLP: yes, distributions: yes, ... etc) is OK with me, my worry was with the all-languages + collector table which is an explosion of data, I am not keen to maintain.

I second this. Keeping it simple is a good idea. But please add a few lines about "native OTLP" meaning that the vendor supports receiving telemetry using an OTLP endpoint and not requiring a custom exporter.

Here's my proposal:

1. We ask vendors to revalidate their row until `DATE` like the following:

2. Bring proof that your backend supports native OTLP, that you have a distribution or that you require an exporter. Those proofs have to be a link to their docs for showing OTLP support & a link to their distribution/exporter. Those links will be included in the table.

3. When we pass the deadline `DATE` all vendors without an update will be removed until they get back with that data.

4. If a link is broken, we will set it back to "NO" and let the vendor now that they need to update.

Seems perfect to me.

svrnm commented 2 years ago

So, adding your feedback, the table could look something like this:

Name backend with native OTLP support vendor-specific exporter Distribution
Vendor A [link to docs] NO [link to collector distro] [link to SDK distro]
Vendor B NO [link to collector exporter] [link to collector distro]
Vendor C NO [link to collector exporter] [link to SDK exporters] NO
Vendor D [link to docs] (only traces) NO  NO

(As an alternative there could be separate columns for Collector/SDK in exporter&Distro

cartermp commented 2 years ago

Note that [link to SDK distro] will need to be plural, since several vendors have several SDK distributions.

nerochiaro commented 2 years ago

So, adding your feedback, the table could look something like this: Name backend with native OTLP support vendor-specific exporter Distribution Vendor A [link to docs] NO [link to collector distro] [link to SDK distro] Vendor B NO [link to collector exporter] [link to collector distro] Vendor C NO [link to collector exporter] [link to SDK exporters] NO Vendor D [link to docs] (only traces) NO NO

(As an alternative there could be separate columns for Collector/SDK in exporter&Distro

Isn't "Native OTLP endpoint" less ambiguous and shorter than "backend with native OTLP support" ? Other than that, seems good.

atrauzzi commented 2 years ago

Agreed, we need to be super explicit about whether users can just use a community library in their processes and send them to a well-known endpoint.

Vendors are going to dance around with these concepts and it's important for this list to help people identify which vendors are playing nicely.