twitter / finagle

A fault tolerant, protocol-agnostic RPC system
https://twitter.github.io/finagle
Apache License 2.0
8.79k stars 1.46k forks source link

Support propagating 128-bit trace identifiers #564

Open codefromthecrypt opened 8 years ago

codefromthecrypt commented 8 years ago

When Finagle receives a 128-bit trace identifier, it should be able to propagate it without reducing its size to 64-bits.

Expected behavior

TraceId should have an extra field to carry the high bits of a 128-bit trace id. In other instrumentation, this is called traceIdHigh (or traceIdHi). This should be initialized to zero, and encoded if set.

For example, if serializing to the X-B3-TraceId header, when non-zero, traceIdHigh becomes the left-most 16 hex characters. If serializing to a buffer, it prefixes the existing traceId bytes at position 16 of a 40 (extended from 32) byte buffer.

Actual behavior

Currently, TraceId.traceId is a SpanId which is a 64-bit long in representation. This is serialized as a 16character hex X-B3-TraceId header or 8 bytes in the TraceId binary form (offset 16).

Until TraceId.traceIdHigh or similar exists, and the serialized forms are updated accordingly, any application propagating through finagle will downgrade from 128 to 64bit trace identifiers. While this is fine in the current state of zipkin, eventually, this will become a problem (ex when propagating out to a different tracing system that requires the original 128-bit id)

codefromthecrypt commented 8 years ago

@kevinoliver @mosesn while the timing of implementing this isn't critical, I'd like buy-in on the approach used for binary serialization of TraceId (when 128-bits are in use). This is so that other libraries can update in anticipation of similar updates here.

The two most sensible choices seem to be..

kevinoliver commented 7 years ago

@adriancole What approach are other systems taking here? Its not obvious to me what the tradeoffs are between those two choices.

codefromthecrypt commented 7 years ago

Brave (who's binary form in cassandra tracing context) uses the former (keep the two halves of the ID next to eachother in byte order). I think this makes most sense, but you do need to guard on length to see if you should read a traceIdHigh or not.

The latter was mainly to show two options :) Hypothetically, some demarshaller which works today might work by just choosing not to read the last 8 bytes. That said, it really depends on how the bytes are carried.

codefromthecrypt commented 7 years ago

So I was chatting with @ellispritchard who maintains https://github.com/Financial-Times/tapper, and @fishcakez who is using it in a mixed environment with finagle. Also, I've thought about this a while now.

I think the following is the best way for binary propagation:

For others, we'd follow existing approaches that are working well for nearly a year now:

I'm happy to help with this. Now we've a business case (existing user needing it), seems a good time to start. Sound good?

mosesn commented 7 years ago

Sounds good to me!

jimschubert commented 7 years ago

My company uses 128bit trace ids within our tracing system (based on zipkin). My team's stack is exclusively based on top of Finagle (Finatra, finagle-mysql, finagle-redis) and we regularly have colliding ids because of the reduction to Long within Finagle.

I'd be interested in evaluating the above suggestion when it's ready.

mosesn commented 7 years ago

@jimschubert would you be interested in taking a stab at it?

jimschubert commented 7 years ago

@mosesn I don't have an lot of free time outside of work right now. I can see about allocating time for this at work, but identifying alternatives is low priority on our backlog at the moment. If I can get get the time, I'd love to take a stab at it. I'll let you know here if I can, and I'll hit you up on gitter if I get stuck.

mosesn commented 7 years ago

@jimschubert sounds good to me!

jimschubert commented 7 years ago

@adriancole @mosesn I took an initial stab at this. I started down the road of adding support for 128-bit SpanID and ParentID, but I think it's overkill. Could you guys check out the linked PR #651 and give feedback?

codefromthecrypt commented 7 years ago

Thx for the start. Widening X-B3-TraceId seems the most pragmatic start.

For binary, we could consider the check length approach of growing thrift later or adopt census (aka dapper2) format though it doesnt send parent ID https://github.com/census-instrumentation/opencensus-specs/blob/master/encodings/BinaryEncoding.md