mdsol / Medidata.ZipkinTracerModule

[Deprecated] Zipkin Request Tracing for .Net Apps
MIT License
53 stars 20 forks source link

[MCC-261721] Don't break on 128bit X-B3-TraceId by tossing high bits #100

Closed bvillanueva-mdsol closed 7 years ago

bvillanueva-mdsol commented 7 years ago

Information about the problem: https://github.com/mdsol/Medidata.ZipkinTracerModule/issues/96

This PR will prevent the zipkin client to create a new trace if it encounters a 128 bit trace id. It will accept the trace but only recognizes the lower 16 characters (64 bit).

@kenyamat @lschreck-mdsol @hkurniawan-mdsol @jcarres-mdsol Please review and merge

FYI @adriancole

bvillanueva-mdsol commented 7 years ago

@adriancole Thanks for the comments. Please check out new updates addressing the comments f9dc4ff. Only the traceId is being "sanitized" to get lower 16 characters of the incoming traceid. Thanks!

bvillanueva-mdsol commented 7 years ago

Please merge if ok :+1:

codefromthecrypt commented 7 years ago

heh I don't have merge rights. merge away!

jcarres-mdsol commented 7 years ago

This will accept 128bits and propagate 64. I think it is better to accept both and propagate whatever we get so eventually when 128 go throught the system, everything is 128

bvillanueva-mdsol commented 7 years ago

@adriancole @jcarres-mdsol Updated code 8247ac4 Changes:

Please review and merge if you are cool with these changes. Thanks!

FYI @kenyamat @hkurniawan-mdsol @lschreck-mdsol

jcarres-mdsol commented 7 years ago

The zipkin server is ready to accept 128bits or 64 bits. Just send anything you get

bvillanueva-mdsol commented 7 years ago

@adriancole @jcarres-mdsol Question:

  1. When sending a request to server with these conditions: path: /api/v1/spans method "POST" traceId: 128bit

Is the server expecting a hex string for the traceid part?

FYI: @lschreck-mdsol @kenyamat @hkurniawan-mdsol

jcarres-mdsol commented 7 years ago

@bvillanueva-mdsol good question, the new api doc is not published it seems but I would expect it to be the same format than now BTW, you can explore the api docs here http://zipkin.io/zipkin-api/#/paths/%252Fspans/post/parameters/span

jcarres-mdsol commented 7 years ago

The new docs when published will say this 8-byte or 16-byte identifier respectively encoded as 16 or 32 lowercase hex characters.

codefromthecrypt commented 7 years ago

yep same hex encoding. only thing that changes is length (in json/http headers anyway)

On Sun, Nov 13, 2016 at 10:49 PM, Jordi Polo Carres < notifications@github.com> wrote:

The new docs when published will say this "'8-byte or 16-byte identifier respectively encoded as 16 or 32 lowercase hex characters.'"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mdsol/Medidata.ZipkinTracerModule/pull/100#issuecomment-260260636, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD618_Sx7aZlJL__3vSy2UkTy6NVyo9ks5q-AR9gaJpZM4Kmz1u .

bvillanueva-mdsol commented 7 years ago

@adriancole @jcarres-mdsol Thanks, Im trying to change the implementation to send 128 or 64 bit to zipkin server. Will update you guys if it is done :)

codefromthecrypt commented 7 years ago

great. so initiating128-bit traces should probably be a config flag, propagating them shouldn't require config.

ex in brave we say traceId128Bit(true) to start traces as 128-bit (32-character length in headers/json or set traceIdHigh in thrift)

https://github.com/openzipkin/b3-propagation/issues/6

On Sun, Nov 13, 2016 at 11:48 PM, Brent Villanueva notifications@github.com wrote:

@adriancole @jcarres-mdsol Thanks, Im trying to change the implementation to send 128 or 64 bit to zipkin server. Will update you guys if it is done :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

codefromthecrypt commented 7 years ago

my 2p is that unit tests should be where the code exists. punting all the way to the server to tell if the id is well formatted seems against that principle.

On Wed, Nov 16, 2016 at 10:15 AM, Herry Kurniawan notifications@github.com wrote:

@hkurniawan-mdsol commented on this pull request.

In src/Medidata.ZipkinTracer.Core/TraceProvider.cs https://github.com/mdsol/Medidata.ZipkinTracerModule/pull/100#pullrequestreview-8739040 :

@@ -72,11 +73,11 @@ internal TraceProvider(IZipkinConfig config, IOwinContext context = null) headerSampled = context.Request.Headers[SampledHeaderName]; }

  • TraceId = Parse(headerTraceId) ? headerTraceId : GenerateHexEncodedInt64FromNewGuid();
  • SpanId = Parse(headerSpanId) ? headerSpanId : TraceId;
  • ParentSpanId = Parse(headerParentSpanId) ? headerParentSpanId : string.Empty;
  • TraceId = headerTraceId.IsParsableTo128Or64Bit() ? headerTraceId : GenerateHexFromNewGuid(config.Create128BitTraceId);

I don't think we need to check the validity of the trace id here ( IsParsableTo128Or64Bit). The only thing that we need to check is whether headerTraceId is null or empty and generate a new one if it does. Generating a new one due to invalid will break the request flow. We should just give it to Zipkin server and if it blows up we can at least trace it back to the original request that created the invalid trace id.

What do you think @adriancole https://github.com/adriancole, @jcarres-mdsol https://github.com/jcarres-mdsol, @kenyamat https://github.com/kenyamat, @lschreck-mdsol https://github.com/lschreck-mdsol?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mdsol/Medidata.ZipkinTracerModule/pull/100#pullrequestreview-8739040, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61z55jlD-RgDYAafL5navod7GBw52ks5q-mdbgaJpZM4Kmz1u .

bvillanueva-mdsol commented 7 years ago

@adriancole Updated unit test to check lowercase hex strings are generated from traceprovider 05c90b6 8d222f7

codefromthecrypt commented 7 years ago

Regarding 2, I'm not suggesting the middleware to throw exceptions. I think the middleware should just accept whatever trace id it receives and sends it to Zipkin server. Errors will occur on the server but this should not affect requests and we can see the log (if any) on who initiated the malformed id.

This is what I worry about. [image: image] https://cloud.githubusercontent.com/assets/21052449/20336050/b75ea3e8-ac0a-11e6-95ea-9a8b285cab2a.png Say that the trace id is malformed when calling Svc 1 and a new one is created. Apparently the total call of Service A violates the SLA. How can we determine which service under Service A that causes the slowdown?

If a malformed ID is sent to the server, then you won't see trace at all as the span will be dropped. Your only signal would be a metric increment for span drop (due to parse failure). Restarting the trace allows you to see the operation even if its predecessor was buggy. always encoding malformed IDs will result in seeing nothing.

codefromthecrypt commented 7 years ago

PS you can go ahead and test this, too. just start zipkin and send bad trace ids to its endpoint. Nothing will show up, so you will have no means to debug the diagram mentioned.

codefromthecrypt commented 7 years ago

ps feel free to hop on https://gitter.im/openzipkin/zipkin if you like (could be quicker).

Seems the idea here is to have a 3rd option, here's all three

In ideal case, the third scenario is best, since upstream is written properly, there are no problems.

Ex. let's say we are using the trust approach and service A calls B calls C

In ideal case, this doesn't have any impact, as the correct format is written in the first place, when zipkin gets it, it can parse it. We have seen bugs where people use invalid format, such as signed numbers and even an - sign by accident. These bugs are rare and easily fixed, but here's the impact on exception.

Alternatively, let's say we restart a trace when its ID is malformed.

When we restart, not only do we at least see a partial trace, but we also can increment a counter or log the calling context of who propagated a malformed ID. This would include any metadata in the HTTP request. This could help identify and troubleshoot service A faster than seeing no data at all.

codefromthecrypt commented 7 years ago

Will try adhoc-ing this with @bvillanueva-mdsol https://github.com/bvillanueva-mdsol.

Seems like Zipkin is not logging or returning BadRequest when it receives malformed data. In that case re-starting the trace is a solution. This means by swallowing the malformed id we would never know that we have a bug somewhere and we would have requests with broken traces. If everyone is okay with this I'm okay as well.

It is worse than this, actually. In most libraries, reporting to zipkin is asynchronous, which means the call that had the malformed ID already took place, sometime later it is sent to zipkin. Moreover, many users send via Kafka or other async transports which would offer no up-front validation.

codefromthecrypt commented 7 years ago

Seems like Zipkin is not logging or returning BadRequest when it receives malformed data.

if you are running latest zipkin, it does actually show an error when there's malformed data. I think this was recently noticed.

$ curl -s localhost:9411/api/v1/spans -H'Content-Type: application/json' -d 'hello world'

Malformed reading List from json: hello world

Also, there's a counter..

$ curl -s localhost:9411/metrics|jq '."counter.zipkin_collector.messages_dropped.http"'

2

bvillanueva-mdsol commented 7 years ago

It is worse than this, actually. In most libraries, reporting to zipkin is asynchronous, which means the call that had the malformed ID already took place, sometime later it is sent to zipkin. Moreover, many users send via Kafka or other async transports which would offer no up-front validation.

@adriancole @hkurniawan-mdsol The same situation with zipkin dot net client as it is submitted later on another thread. Spans are submitted by batch.

codefromthecrypt commented 7 years ago

It is worse than this, actually. In most libraries, reporting to zipkin is asynchronous, which means the call that had the malformed ID already took place, sometime later it is sent to zipkin. Moreover, many users send via Kafka or other async transports which would offer no up-front validation.

@adriancole https://github.com/adriancole @hkurniawan-mdsol https://github.com/hkurniawan-mdsol The same situation with zipkin dot net client as it is submitted later on another thread. Spans are submitted by batch

ok yeah, so in this case, even if you look at the http response from zipkin in the batch sender, you have less context about the caller that created one of the spans, unless in your zipkin span you are logging CLIENT_ADDR(ca). Still debuggable, but a bit more tricky vs linting at the point of propagation.

h-kurniawan commented 7 years ago

@adriancole, @bvillanueva-mdsol just adhoc-ed with malformed trace id and we get 400 BadRequest back from Zipkin.

Failed to send spans to Zipkin server (HTTP status code returned: BadRequest). Exception message: The remote server returned an error: (400) Bad Request., response from server: Cannot decode spans due to IllegalArgumentException(47318d050e42707x should be a 1 to 32 character lower-hex string with no prefix reading List<Span> from json: [{"traceId":"47318d050e42707x","name":"get","id":"452b4857980c7431","annotations":[{"endpoint":{"ipv4":"192.168.56.1","port":12008,"serviceName":"localhost"},"value":"sr","timestamp":1479284551122624},{"endpoint":{"ipv4":"192.168.56.1","port":12008,"serviceName":"localhost"},"value":"ss","timestamp":1479284551132626}],"binaryAnnotations":[{"endpoint":{"ipv4":"192.168.56.1","port":12008,"serviceName":"localhost"},"key":"http.path","value":"/help_center"}]}])

This actually throws an exception on ZipkinTracerModule which being catched and logged to presumably SumoLogic. Since we are logging these errors so generating trace id should not be a problem.

bvillanueva-mdsol commented 7 years ago

@adriancole @hkurniawan-mdsol @kenyamat @jcarres-mdsol Will go for regenerating new traceid if client encounters malformed traceid. Ill take out NRTM (Not ready for merge) tag if I am finish adhocing it on two services. Thanks!

codefromthecrypt commented 7 years ago

good luck and thanks for all the effort here!

bvillanueva-mdsol commented 7 years ago

Adhoc-ed the changes successfully. Checked that server will accept 128 bit as well as 64 bit. and the client propagates the traceid (128 or 64 bit)

@jcarres-mdsol @kenyamat @hkurniawan-mdsol This is ready for merge. Please merge if it is ok. Thanks!

@adriancole Thanks for your input in this PR.

kenyamat commented 7 years ago

👍 👍 👍

codefromthecrypt commented 7 years ago

great! let me know which version this is released as so I can update the tracking doc on 128-bit support

bvillanueva-mdsol commented 7 years ago

@adriancole will let you know. I'll be submitting another PR to solve the problem on "X-B3-Sampled" to accept "1" as true and "0" as false. Currently library only supports "true" and "false".

Thanks!

bvillanueva-mdsol commented 7 years ago

@adriancole

We released a new zipkin dotnet client nuget. Please check out release page https://github.com/mdsol/Medidata.ZipkinTracerModule/releases/tag/v3.2.0 and nuget page https://www.nuget.org/packages/Medidata.ZipkinTracer.Core/3.2.0

Thanks, Brent

codefromthecrypt commented 7 years ago

fantastic. thanks!

On Mon, Nov 21, 2016 at 1:23 PM, Brent Villanueva notifications@github.com wrote:

@adriancole https://github.com/adriancole

We released a new zipkin dotnet client nuget. Please check out release page https://github.com/mdsol/Medidata.ZipkinTracerModule/releases/tag/v3.2.0 and nuget page https://www.nuget.org/packages/Medidata.ZipkinTracer.Core/3.2.0

Thanks, Brent

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mdsol/Medidata.ZipkinTracerModule/pull/100#issuecomment-261846447, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD612aTdDYh1UApbhKZGYcu-L8GzkHJks5rASrugaJpZM4Kmz1u .

codefromthecrypt commented 7 years ago

@jcarres-mdsol do you think you'll have cycles for similar support for zipkin-ruby?