Closed srikanthccv closed 3 years ago
OTLP/HTTP is slightly more CPU efficient than OTLP/gRCP so it may be preferable. However, it may be that most SDKs implement OTLP/gRCP already, but not OTLP/HTTP so making OTLP/HTTP may not work.
This likely needs a discussion in Maintainers meeting. Adding to the next week's agenda.
From the maintainers call: general agreement that gRPC should be the default on each artifact, unless the HTTP implementation is substantially better on that artifact. Most maintainers in the call were already following this pattern.
Here is some informal performance data for Go implementation (benchmarks an OTLP client+server):
32000 nano batches, 1 spans per batch, 10 attrs per span
OTLP/HTTP1.1/1 32000 spans| 6.2 cpusec| 4.7 wallsec| 5144.7 batches/cpusec| 6832.8 batches/wallsec|194.4 cpuμs/span
OTLP/GRPC-Unary/Sequential 32000 spans| 13.2 cpusec| 6.0 wallsec| 2418.7 batches/cpusec| 5370.4 batches/wallsec|413.4 cpuμs/span
8000 tiny batches, 10 spans per batch, 10 attrs per span
OTLP/HTTP1.1/1 80000 spans| 4.4 cpusec| 3.8 wallsec| 1830.7 batches/cpusec| 2097.9 batches/wallsec| 54.6 cpuμs/span
OTLP/GRPC-Unary/Sequential 80000 spans| 6.0 cpusec| 4.1 wallsec| 1322.3 batches/cpusec| 1932.3 batches/wallsec| 75.6 cpuμs/span
4000 small batches, 100 spans per batch, 10 attrs per span
OTLP/HTTP1.1/1 400000 spans| 16.6 cpusec| 16.0 wallsec| 241.5 batches/cpusec| 250.5 batches/wallsec| 41.4 cpuμs/span
OTLP/GRPC-Unary/Sequential 400000 spans| 17.5 cpusec| 15.4 wallsec| 228.2 batches/cpusec| 259.3 batches/wallsec| 43.8 cpuμs/span
400 large batches, 500 spans per batch, 10 attrs per span
OTLP/HTTP1.1/1 200000 spans| 7.2 cpusec| 7.1 wallsec| 55.2 batches/cpusec| 56.5 batches/wallsec| 36.2 cpuμs/span
OTLP/GRPC-Unary/Sequential 200000 spans| 7.3 cpusec| 6.9 wallsec| 55.1 batches/cpusec| 57.7 batches/wallsec| 36.3 cpuμs/span
50 very large batches, 1000 spans per batch, 10 attrs per span
OTLP/HTTP1.1/1 50000 spans| 1.8 cpusec| 1.7 wallsec| 27.9 batches/cpusec| 28.6 batches/wallsec| 35.8 cpuμs/span
OTLP/GRPC-Unary/Sequential 50000 spans| 1.8 cpusec| 1.7 wallsec| 28.6 batches/cpusec| 29.4 batches/wallsec| 35.0 cpuμs/span
OTLP/HTTP is about 2x more CPU efficient for batches that contain a single span. For larger batches (>=100 spans) the difference between OTLP/HTTP and OTLP/gRPC is insignificant.
I am fine with making/keeping gRPC the default given that it is already what we do unless someone feels very strong about the performance impact for small batches (and keep in mind that in other languages the benchmarks may show a different result, this is just for Go).
It seems like supporting OTLP/gRPC in Ruby is not straightfoward. @fbogsany please comment here.
I am personally slightly more favour of OTLP/HTTP but it seems like other languages preferred gRPC.
If we believe it is more complicated (and undesirable) to make OTLP/gRPC the default for Ruby then we either make OTLP/HTTP the default and required for al languages (I would prefer this) or allow each language to choose a default (this is a bit less desirable since we loose uniformness and require receivers to support both transports to be able to receive from variety of Otel SDKs).
Copied from https://github.com/open-telemetry/opentelemetry-specification/pull/1969#issuecomment-928100335
gRPC is less commonly used in the Ruby ecosystem than in other languages and Ruby appears to be a lower priority than other languages for gRPC maintainers. This means that we have much less experience with gRPC in the Ruby SIG and little demand for it so far. Battle scars from spectacular Ruby + gRPC production failures also make some of us quite wary of the support burden.
We have had considerable success with OTLP/HTTP in OpenTelemetry Ruby. Is there a strong reason to force gRPC as a default transport rather than leaving the default up to individual SIGs? Similarly, is there a strong reason to require gRPC as a transport? Both of these are challenging for the Ruby SIG.
From the maintainers call: general agreement that gRPC should be the default on each artifact, unless the HTTP implementation is substantially better on that artifact. Most maintainers in the call were already following this pattern.
@mtwo (and anybody else who was in the Maintainers call) can you please list the arguments that the Maintainers had in favour of OTLP/gRPC as the default?
There is evidence that it is difficult to implement (Ruby) and performs worse that OTLP/HTTP (see my comment).
From a theoretical perspective, gRPC needs a HTTP/2 library underneath, so technically you should be able to implement OTLP/HTTP at least everywhere where you can implement OTLP/gRPC. The opposite is not true, since there might not be a (good) gRPC implementation for your target language/framework/version.
Is there any reason why gRPC should be preferred over HTTP as a minimal requirement?
From the compliance matrix, it seems that C++, Rust, Swift and .NET have implemented gRPC but not HTTP. For .NET there is a open PR already (open-telemetry/opentelemetry-dotnet#2316). I don't know about the other languages. I don't expect that implementing the HTTP export would be an obstacle though, it's probably a matter of priority why only gRPC is implemented. In doubt we should ask the maintainers of these languages though.
(PHP also has not implemented OTLP/HTTP, but it also doesn't have OTLP/gRPC).
Taking a quick look at Rust, it actually seems to be implemented (for traces only): https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-otlp/src/exporter/http.rs and https://github.com/open-telemetry/opentelemetry-rust/blob/e7ecf58e0b0d1eeff213e458e47e2238575826b4/opentelemetry-otlp/src/lib.rs#L253-L261 (CC @TommyCpp @open-telemetry/rust-approvers please update the compliance matrix if that is correct)
C++ also seems to support OTLP/HTTP actually: https://github.com/open-telemetry/opentelemetry-cpp/tree/main/exporters/otlp#configuration-options--otlp-http-exporter- (both JSON and binary according to that documentation) (CC @owent @lalitb @reyang)
That leaves only swift as a language with gRPC support but no HTTP support. On the other hand we have Ruby where we know that supporting gRPC would not be easy while HTTP is supported.
We should also be clear about OTLP/HTTP+protobuf (what I think we are talking about) vs OTLP/HTTP+JSON (what some seem to view as default for OTLP/HTTP)
As far as I remember, we only supported OTLP/HTTP+protobuf but not OTLP/HTTP+JSON. Will update spec accordingly.
For .NET there is a open PR already (open-telemetry/opentelemetry-dotnet#2316).
For .NET, we'd be keen on defaulting to gRPC simply because it would be a breaking change to change it to OTLP/HTTP once this PR lands.
general agreement that gRPC should be the default on each artifact, unless the HTTP implementation is substantially better on that artifact.
If the spec could be flexible and recommend gRPC as default but allow for HTTP as default when it makes sense, I think this would be best. I'm not a JS expert, but my understanding is that gRPC is not a thing in browsers, so this may be another use case where HTTP is preferred (or required).
From a theoretical perspective, gRPC needs a HTTP/2 library underneath, so technically you should be able to implement OTLP/HTTP at least everywhere where you can implement OTLP/gRPC. The opposite is not true, since there might not be a (good) gRPC implementation for your target language/framework/version.
+1. Up until recently if I remember correctly AWS load balancers did not even let gRPC connections through (I think they do now), so there may be real production cases when you simply cannot use gRPC.
Is there any reason why gRPC should be preferred over HTTP as a minimal requirement?
The only reason I can think of is that more SDKs today implement gRPC than HTTP, so it appears making HTTP is more work for Otel maintainers. It does not seem to be a huge deal though, if you have a working OTLP/gRPC exporter, adding OTLP/HTTP should not be very complicated.
We should also be clear about OTLP/HTTP+protobuf (what I think we are talking about) vs OTLP/HTTP+JSON (what some seem to view as default for OTLP/HTTP)
The spec currently strongly hints that OTLP/HTTP+JSON is optional:
SDKs MUST support either grpc or http/protobuf and SHOULD support both. They also MAY support http/json.
From the compliance matrix, it seems that C++, Rust, Swift and .NET have implemented gRPC but not HTTP. For .NET there is a open PR already (open-telemetry/opentelemetry-dotnet#2316). ~I don't know about the other languages.~ I don't expect that implementing the HTTP export would be an obstacle though, it's probably a matter of priority why only gRPC is implemented. In doubt we should ask the maintainers of these languages though. (PHP also has not implemented OTLP/HTTP, but it also doesn't have OTLP/gRPC).
Taking a quick look at Rust, it actually seems to be implemented (for traces only): https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-otlp/src/exporter/http.rs and https://github.com/open-telemetry/opentelemetry-rust/blob/e7ecf58e0b0d1eeff213e458e47e2238575826b4/opentelemetry-otlp/src/lib.rs#L253-L261 (CC @TommyCpp @open-telemetry/rust-approvers please update the compliance matrix if that is correct)
C++ also seems to support OTLP/HTTP actually: https://github.com/open-telemetry/opentelemetry-cpp/tree/main/exporters/otlp#configuration-options--otlp-http-exporter- (both JSON and binary according to that documentation) (CC @owent @lalitb @reyang)
That leaves only swift as a language with gRPC support but no HTTP support. On the other hand we have Ruby where we know that supporting gRPC would not be easy while HTTP is supported.
@Oberon00 very useful research, thank you. I think this reinforced that overall adding support for OTLP/HTTP where it is missing is not a big effort, while OTLP/gRPC where it is missing looks problematic.
Given this new evidence I now am now more strongly incline towards making OTLP/HTTP the required protocol. Let's hear some strong arguments against if there are any.
If the spec could be flexible and recommend gRPC as default but allow for HTTP as default when it makes sense, I think this would be best. I'm not a JS expert, but my understanding is that gRPC is not a thing in browsers, so this may be another use case where HTTP is preferred (or required).
We could make OTLP/HTTP support a SHOULD
clause, so strongly recommended but still allowed to deviate from. The only problem with this approach is that it makes impossible for OTLP receivers to only support OTLP/HTTP and claim full OTLP compatibility. Perhaps not a big deal, but I would prefer to simply the life of receivers if possible by doing a bit more work on our end.
Java's been released for several months now and has always used gRPC as the default for the env variable - it would be extremely disruptive to users to change the default at this point I think. This issue doesn't seem all that different from #1923 and I suspect allowing SIGs to have their default is going to be the most reasonable, it's one of those values that needed to be set early on or never set.
Making OTLP/HTTP the required protocol seems ok though to me - I agree that every language should be able to support HTTP easily if they can support gRPC, while otherwise may not even be able to support gRPC at all. It just leaves the somewhat awkward situation where languages like Java would be defaulting to the non-required protocol. It's not terrible though IMO.
@anuraaga so just to clarify, you suggest that:
I think this is OK, just a bit more work on the receivers side, but otherwise not a big problem.
OTLP/HTTP/JSON (do we allow OTLP/HTTP/JSON to be default since it is not stable yet?). This necessitates receivers to support all 3 if they want to be fully OTLP compatible.
Everything aligns with me except this part - it depends on if JS in browser absolutely requires JSON. My personal experience with grpc-web is binary protobuf (not grpc proper) in browsers is OK - I suspect that browsers that did poorly with binary payloads are not supported by the vast majority of sites now, and only search engines, being core infrastructure, would still target such old ones. I don't expect such code to actually migrate to use OTel in practice, they generally delegate old browsers to old heirloomed code.
If this is not true based on the JS SIG's experience, then that's that since they know best - for a receiver to be fully compatible even with browsers, it would need to support JSON. I think it is worth confirming this is really true in 2021. And if it is, special language for just the browser - so grpc and http are enough to "support OpenTelemetry" but JSON may be needed if that receiver specifically supports browsers - seems appropriate. Forcing support for web browsers on any backend implementing OTel seems like an overreach.
One last note - if we do require receivers to support JSON, please please please require parsing to support reading trace IDs either as hex or base64. Protobuf-json doesn't support hex so it's irresponsible to require implementing the entire parsing logic by hand just because of this, especially given SDKs won't implement this since they are only concerned with export.
For reference, this is how we "parse" for SDK unit tests
Edit: Sorry after rereading my comment I realized it's not "require parsing to support reading trace IDs either as hex or base64" because that is already too hard, the hex path needs to be implemented. IMO it is either 1) Only support base64, just like proto or 2) require SDKs to provide parsing logic so backends can use it and not struggle so much on this deviation from proto3.
Forcing support for web browsers on any backend implementing OTel seems like an overreach.
On second thought I'm not so confident in this. It does seem a little excessive, but in the proposed vision doc Compatibility
is one of the engineering values - so forcing support for JSON (assuming it is required for browsers) does seem correct.
One last note - if we do require receivers to support JSON, please please please require parsing to support reading trace IDs either as hex or base64. Protobuf-json doesn't support hex so it's irresponsible to require implementing the entire parsing logic by hand just because of this, especially given SDKs won't implement this since they are only concerned with export.
I was able to override just the parsing to trace ids/span id and let Protobuf's default implementation parse the rest: https://github.com/open-telemetry/opentelemetry-collector/blob/afad36e3785b78906f6eb5f9e91354a9e05a274c/receiver/otlpreceiver/otlp_test.go#L76 At least in Go this is possible, not sure if Protobuf libraries for other languages allow this.
I added this to next Maintainers' meeting agenda. I am not sure I will be able to join, but this needs to be discussed. So far I believe we are inclined to say that each language is free to choose their own default (we can also in addition recommend that OTLP/HTTP/Protobuf is the default).
Just checked: AWS still does not fully support gRPC for all load balancer types: https://aws.amazon.com/elasticloadbalancing/features/ It works with ALB, but ~does not work with NLB~ (UPDATE: see comment below) or other load balancer types.
@tigrannajaryan NLB is just a TCP/IP load balancer so it is not tied to an application protocol like gRPC. Actually before ALB supported gRPC, NLB was the only way to implement a gRPC API with a load balancer on AWS. It had the limitation of not being able to integrate things like TLS, so the support in ALB made it much nicer :) But I would consider this to be full support for gRPC on the cloud side.
@tigrannajaryan NLB is just a TCP/IP load balancer so it is not tied to an application protocol like gRPC. Actually before ALB supported gRPC, NLB was the only way to implement a gRPC API with a load balancer on AWS. It had the limitation of not being able to integrate things like TLS, so the support in ALB made it much nicer :) But I would consider this to be full support for gRPC on the cloud side.
All right, I guess I was misled by the gRPC
label present in the ALB column and missing in the rest of the columns in the AWS doc I linked to.
@tigrannajaryan It seems the "Layer 7" section header type of thing is the scapegoat for that :) I think it basically means "Classic Load Balancer in L7 mode" doesn't support gRPC. FWIU, it's EoL so probably not needed for OTLP https://aws.amazon.com/blogs/aws/ec2-classic-is-retiring-heres-how-to-prepare/
Discussed again in Maintainers meeting. Raw meeting notes:
Feedback from Bob: +1, gRPC has great implementations in C++, Go, Python, and Java, but the other language clients have usability and performance challenges. HTTP is ubiquitous.
Ted: we should still ensure that languages support both, even if they have a preference for HTTP or gRPC!
Proposal: maintainers / the TC no longer have a preference between gRPC or HTTP being the default, but OpenTelmetry components should support both.
No objections in the group
To summarize this is what we want to do:
I am going to rewrite this PR according to the plan above. If there are any objections to the plan please speak up.
What are you trying to achieve?
Exporter selection https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/sdk-environment-variables.md#exporter-selection part of spec has default for jaeger and zipkin but there is no default for otlp. It is unclear whether to use
otlp/grpc
orotlp/http
for the client libraries that have support for both.