openzipkin / brave

Java distributed tracing implementation compatible with Zipkin backend services.
Apache License 2.0
2.36k stars 714 forks source link

Consider following OpenTelemetry's broken implementation of trace-context #1287

Open codefromthecrypt opened 3 years ago

codefromthecrypt commented 3 years ago

The W3C trace-context format has 2 headers. The primary one is tracestate which holds a durable copy of what would be in B3 single format. This is used in case the next hop doesn't report into the same tracing system. By prioritizing tracestate, you can reliably resume. traceparent was a copy of that data used for correlation purposes.

OpenTelemetry are also the same people who maintain the W3C trace-context format, but they have implemented it backwards. This has happened elsewhere including Jaeger and Sleuth and starting to get all over the ecosystem. The backwards variant doesn't implement tracestate at all, or only parses it. It uses traceparent as primary and doesn't look at tracestate as it doesn't know what to look in there!

The problem this causes is you can no longer know which of tracestate or traceparent is the more recent reference to your position in the tree. Since implementations are updating traceparent regardless of updating tracestate it relegates a conformant implementation as broken!! For example, if you had placed an entry into tracestate, yet downstream decided to send data to the same system without updating that.. if you had the next span use the parent in tracestate, it would be higher in the trace tree. Far worse than status quo without trace-context at all.

My suggestion is we consider ignoring the spec similar to how others do, and treat traceparent as a lossy variant of b3 single and tracestate as copy-through baggage. In such case, we would probably delete https://github.com/openzipkin-contrib/brave-propagation-w3c as it is a lost cause at this point. People literally are using this backwards code in production, so maybe best to conform to what people are doing instead of what the spec was written to do.

cc @openzipkin/core

jcchavezs commented 3 years ago

It seems OTel people (those also controlling W3C trace context) are happy with this so I don't see the point in us trying to do the right implementation of the format when not even them follow it. I'd just do the same thing and add a huge disclaimer saying why we also have a broken implementation.

tor. 24. des. 2020 kl. 13:24 skrev Adrian Cole notifications@github.com:

The W3C trace-context format has 2 headers. The primary one is tracestate which holds a durable copy of what would be in B3 single format. This is used in case the next hop doesn't report into the same tracing system. By prioritizing tracestate, you can reliably resume. traceparent was a copy of that data used for correlation purposes.

OpenTelemetry are also the same people who maintain the W3C trace-context format, but they have implemented it backwards. This has happened elsewhere including Jaeger and Sleuth and starting to get all over the ecosystem. The backwards variant doesn't implement tracestate at all, or only parses it. It uses traceparent as primary and doesn't look at tracestate as it doesn't know what to look in there!

The problem this causes is you can no longer know which of tracestate or traceparent is the more recent reference to your position in the tree. Since implementations are updating traceparent regardless of updating tracestate it relegates a conformant implementation as broken!! For example, if you had placed an entry into tracestate, yet downstream decided to send data to the same system without updating that.. if you had the next span use the parent in tracestate, it would be higher in the trace tree. Far worse than status quo without trace-context at all.

My suggestion is we consider ignoring the spec similar to how others do, and treat traceparent as a lossy variant of b3 single and tracestate as copy-through baggage. In such case, we would probably delete https://github.com/openzipkin-contrib/brave-propagation-w3c as it is a lost cause at this point. People literally are using this backwards code in production, so maybe best to conform to what people are doing instead of what the spec was written to do.

cc @openzipkin/core https://github.com/orgs/openzipkin/teams/core

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/brave/issues/1287, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXOYARRWTKQHQ3C3R4GULDSWMXGDANCNFSM4VIEZISQ .