spring-cloud / spring-cloud-sleuth

Distributed tracing for spring cloud
https://spring.io/projects/spring-cloud-sleuth
Apache License 2.0
1.78k stars 782 forks source link

Add support for Brave autoconfiguration #711

Closed marcingrzejszczak closed 6 years ago

marcingrzejszczak commented 7 years ago

If Sleuth integrates with Brave then we would start supporting quite a few options that currently Sleuth doesn't support.

We may open issues for some stickier points:

The migration guide is drafted here - https://gist.github.com/marcingrzejszczak/d3c15a0c11dda71970e42c513c9c0e09

shakuzen commented 7 years ago

Would the semantics of this be: if Brave is on the classpath, use it as the tracer implementation; if not, use the existing homegrown Sleuth tracer?

codefromthecrypt commented 7 years ago

right now, folks like @devinsba @jonathan-lo @jorgheymans and @ImFlog have or contribute to brave instrumentation, and this can work for example in classic spring environments or spring boot. Examples are grpc and kafka (not stream). There are also frameworks harder for us to support directly, such as dubbo.

In the case of http, if we had the ability to use brave types, we could obviate a lot of templated patterns, such as http sampling and span data policy.

So, to get highest value, there's probably a question of whether to implement a brave shim (which implies a decent amount of brave to implement it w/o brave), or whether to make sleuth a layer on brave (with the dependency locking implications of that).

There are some shortcuts we could take with a layer, which might reduce the amount of code needed, but it will still largely be copy/paste of code that is automatically fixed in brave.

So, it is a directional decision, something to ponder. Any folks have steering input? cc also @dsyer

jorgheymans commented 7 years ago

IMO the layer makes the most sense in the spring-boot philosophy. Make it easy to quickly get up and running doing tracing using brave/zipkin, provide trace-enable *Template classes and annotations etc, And at the same time facilitate a way to access the full power of brave for more advanced scenarios. This is how most spring-boot integrations work, from my perspective at least.

devinsba commented 7 years ago

Personally I've used both quite a bit over the last 2 years. I would love if sleuth was a layer on top of brave to enable all of the existing brave instrumentation.

For us, sleuth is an easy first step toward zipkin, we get log correlation and all of the built in goodies without needing to set up a reporter. If this was then layered on top of brave then it would allow those teams using things with brave instrumentation (grpc, dynamodb, etc..) to get these things as well.

jonathan-lo commented 7 years ago

Our use case was fairly similar - we'd started with Sleuth, which was fantastic as a starting point, but found that we'd needed additional customisation ability from brave.

ImFlog commented 7 years ago

I also am really interested in this. We are using Sleuth everywhere for simplicity and easy standard tracing. But when we want some custom features it seems odd to switch to Brave. This would improve our possibilities while letting standard use cases to enjoy the built in tracing.

ghost commented 7 years ago

This is something that we could definitely use.

codefromthecrypt commented 7 years ago

here is an interim solution for sending v2 spans and reusing the same reporter and senders brave uses https://github.com/spring-cloud/spring-cloud-sleuth/pull/720

codefromthecrypt commented 7 years ago

added this as it would be easier to setup senders with autoconfiguration. Ex rabbitmq transport for zipkin decorating on top of normal rabbit autoconfig https://github.com/openzipkin/zipkin-reporter-java/issues/71

codefromthecrypt commented 6 years ago

@marcingrzejszczak and I will be attacking this imminently, but it will involve significant effort here and in brave, so probably a month or two until completely done.

codefromthecrypt commented 6 years ago

working approach is to remove sleuth autoconfiguration and introduce brave ones until samples pass

thekalinga commented 6 years ago

Not sure whether this was suggested before or not.

Why not take the approach of Spring Metrics & isolate this library from zipkin.

This way the library can be augmented to support Zipkin,OpenTracing (or) a new standard, natively instead of supporting just OpenZipkin

codefromthecrypt commented 6 years ago

It has been suggested before but not comparing micrometer with OT (flimsy comparison as micrometer is more like census than OT). Suffice to say the suggestion is often sortof off the cuff and hypothetical based on merits mentioned in OT marketing vs actual system interop.

Regardless, the goal isnt to add a hard dependency on OT 0.31.RC2 or any other number of the variants. To the degree one of the many versions of OT are needed it can be via the bridge.

There are already numerous means to export data to different systems via zipkin api (ex jaeger datadog kamon google amazon etc). Other systems can be added without requiring a hard dep on one of the several versions of OT. "Just zipkin" is a half truth at best.

Anyway, by focusing on what the actual value we are trying to provide is, we can better employ the limited hands available to maintaining things. OT has not proven to reduce maintenance or expand options in a meaningful way vs the path already in progress.

On 8 Jan 2018 8:43 pm, "Ashok Koyi" notifications@github.com wrote:

Not sure whether this was suggested before or not.

Why not take the approach of Sprint metris https://micrometer.io/ that isolates this from zipkin.

This way this library can do everything it does to OpenTracing aswell natively instead of just OpenZipkin

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spring-cloud/spring-cloud-sleuth/issues/711#issuecomment-355956837, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61yw_zODjqitSJFPEJeJcYT6VM-LQks5tIg2HgaJpZM4PiyAF .

thekalinga commented 6 years ago

Thanks for the clarification. I'll wait things to settle down in this whole space then (from spring point of view).

I just came across this library that is more or less doing everything you see to be doing in this project for open tracing

https://github.com/opentracing-contrib/java-spring-cloud

Keeping it here just for reference

codefromthecrypt commented 6 years ago

Yeah that library copy pasted a lot of code from sleuth when beginning. You can tell as even log statements are copied verbatum

On 9 Jan 2018 7:27 pm, "Ashok Koyi" notifications@github.com wrote:

Thanks for the clarification. I'll wait things to settle down in this whole space then.

I just came across this library that is more or less doing everything you see to be doing in this project for open tracing

https://github.com/opentracing-contrib/java-spring-cloud

Keeping it here just for reference

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spring-cloud/spring-cloud-sleuth/issues/711#issuecomment-356257820, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD613a32VB8-DYiklXgeRmx3MMTZ_3nks5tI00FgaJpZM4PiyAF .

thekalinga commented 6 years ago

There seems to be demand for open tracing (for me, logs being inline with the trace is a good add). Probably will wait and watch on how zipkin & opentracing playout

codefromthecrypt commented 6 years ago

Thanks for all the feedback. Note that most use log correlation rather than trying to replace their logging backend with zipkin. This is a primary feature of sleuth and brave. Have you used either yet?

On 9 Jan 2018 8:25 pm, "Ashok Koyi" notifications@github.com wrote:

There seems to be demand for open tracing (for me, logs being inline with the trace is a significant tracer). Probably will wait and watch on how zipkin & opentracing diverge/converge

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spring-cloud/spring-cloud-sleuth/issues/711#issuecomment-356269877, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61yDvIQy_dY9Pr0qugN6vRlZ3nDaKks5tI1q1gaJpZM4PiyAF .

thekalinga commented 6 years ago

@adriancole Agreed.

No. I have not used the setup completely in production as of now. I have to integrate observability in my product (which is why I'm scouting for the right combo)

Here is my view based on the documentation I have read

  1. Log aggregation (& visualisation in UI, lets say Kibana) is helpful for detailed analysis & cannot be replaced with tracing system. As traces are only samples, where as logs are not.
  2. Having said that, Having inline logs that accompany trace will help contextualise the issue much quicker, so long as it does not hurt the performance in any meaningful way. Again, this choice can be made based on the performance requirements of the system with opentracing, but not with zipkin
  3. One thing I forgot to mention such as adaptive sampling from Jaeger (WIP as of today) seems to be a much more compelling reason. I believe Jaeger supports zipkin format aswell, not sure if zipkin API supports this or not

Thanks again for engaging with me in helping me understand the space better. Please correct me if I've gone off rails in my understanding

codefromthecrypt commented 6 years ago

Both brave and sleuth do log correlation independently of trace sampling.

With regards to adaptive sampling, zipkin has had an adaptive sampler in the past 'zipkin-zookeeper' in fact it is still built. However most dont agree on a coordination mechanism for the adaptive part (eg zookeeper). Jaeger relies on proprietary agents to coordinate iiuc. This obviously only works when you can deploy and manage host agents.

Regardless the sampling mechanism whether parameterized or based solely on trace id is pluggable. This means that whatever mechanism you choose, there's likely a way to attach it. There is no reason to require an adaptive sampler as a core project, iotw, as it can plugin via interfaces like other things.

Bear in mind that the zipkin ecosystem isnt new and folks have been around the block. There are different ideas and staffing levels which influences tech. For example, uber chose to fork zipkin internally for a year and later open source that work as jaeger. They could afford to create and manage proprietary tech. Most in zipkin ecosystem are lightly staffed so cannot take on the responsibility of logging infra just to do tracing.

Some who have been doing both have other stories to work on besides adaptive sampling. For example, we are currently working on a "firehose mode" which would absorb 100% of the traces obviating adaptive sampling entirely except for overload conditions. This allows folks to do post factum decisions and/or support things like a 1hr window where those on call are likely to have a trace.

Others like census do a local 100% mode with a query endpoint. This acts like stats endpoints in other services and allows things like viewing in-flight transactions or late lazy collection even if at worst manual.

Tldr is that there are many working besides the staff on jaeger etc. If you want to consider all options best to look around as OT is far from the only game in town even if it is likely the best marketed one.

On 9 Jan 2018 8:52 pm, "Ashok Koyi" notifications@github.com wrote:

@adriancole https://github.com/adriancole Agreed.

No. I have not used the setup completely in production as of now. I have to integrate observability in my product (which is why I'm scouting for the right combo)

Here is my view based on the documentation I have read

  1. Log aggregation (& visualisation in UI, lets say Kibana) is helpful for detailed analysis & cannot be replaced with tracing system. As traces are only samples, where as logs are not.
  2. Having said that, Having inline logs that accompany trace will help contextualise the issue much quicker, so long as it does not hurt the performance significantly. Again, this choice can be made based on the performance requirements of the system
  3. One thing I forgot to mention such as adaptive sampling from Jaeger (WIP as of today) seems to be a much more compelling reason. I believe Jaeger supports zipkin format aswell, so no qualms with zipkin here

Thanks again for your engaging with me in helping me understand the space better

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spring-cloud/spring-cloud-sleuth/issues/711#issuecomment-356275606, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD6164JSXquKPC2MTj_dgC8FaLnTmEyks5tI2EQgaJpZM4PiyAF .

thekalinga commented 6 years ago

Thanks for the detailed explanation.

Having agents in kuberntes is a trivial task. Not sure about cloud foundry though