openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
16.97k stars 3.09k forks source link

Service Topology view for a single trace #2622

Open connectwithnara opened 5 years ago

connectwithnara commented 5 years ago

The waterfall diagram in Zipkin UI is useful to understand the latency distribution for a trace. However the view is not very friendly to perceive the services involved in that trace. It will be useful to have another tab 'Topology' and render the topology of the services involved in the trace.

At Netflix we trace 100% of the requests that is enabled for failure injection. The users will find it useful to understand what services were involved in a failure injected request.

codefromthecrypt commented 5 years ago

Thanks for raising this, Nara. I'm positive it has been mentioned at least for a year (including by me!), but never formally in an issue.

So the idea is to take the trace we already have (in fact it is already assembled into a tree internally), then run a linker to aggregate the service dependencies. At that point, it is only changing the data sent to the dependencies page, possibly with labeling of the trace ID, similar to our "show trace" functionality.

I'd be happy to port the dependency linker logic to javascript. @tacigar @zeagord could one of you take a stab at UI/UX on this?

cc also @bulicekj as you may have similar requests in haystack.

codefromthecrypt commented 5 years ago

naver pinpoint also has this feature.

codefromthecrypt commented 5 years ago

Some notes from discussion with @tacigar

I have service graph now, but I cannot tell if there is a problem with one node behavior if I want to understand behavior of a single machine in a service, I cannot do this today. this is because the edges are aggregatate (parent, child, count), based on service name, not any other information in the endpoint such as IP address.

This can make certain problems difficult to understand, such as if one node in a service is running the wrong code, or if a new version of code only deployed to one node is causing a problem.

So, if I have a way to classify by another means, I can identify this type of behavior. For example, IP address. {(parent, ip), (child, ip), count) the dependency link has one more qualifier than before, for this example IP address.

There are now many more nodes.. because what was before just service-service is now (service,ip) -> (service, ip). Inside one trace, this could be fine because maybe not that many combinations

IP is just an example, user may want to explore by site tag like cluster or department or something else. The difference between this and normal dependency graph is we generate the links in javascript. This means we can aggregate by anything, not just service but also custom tag (like http.route).

RestfulBlue commented 5 years ago

i also hope such functionality will appear, for example jaeger also has that feature https://miro.medium.com/max/2625/1*W6OGeCA1unSqQfPIZq7VGg.png

codefromthecrypt commented 5 years ago

https://github.com/openzipkin/zipkin/pull/2731 is a first step as it ports the basic dependency linker used in spark jobs to javascript