openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
16.99k stars 3.09k forks source link

Correct skew when "cs" "cr" is the parent of "sr" "ss" #1492

Open fedj opened 7 years ago

fedj commented 7 years ago

In order to tackle #1480 on the clock skew part, I think that we first need to list possible cases:

Did I miss any use case?

codefromthecrypt commented 7 years ago
  • CS-SR-SS-CR with same end point (not sure it exists but it could be earlier version of local component)

I'd clarify this as loopback. In some cases, a node makes a remote call to itself, which can actually be a latency bug!

Did I miss any use case?

pretty good list! I'd guess that in some cases there will be no-op related to skew on one side or the other.

fedj commented 7 years ago

CS-CR (non-instrumented server) CS-SR-SS-CR with same end point (loopback call) SR-SS (root span) CS (async span without response with non-instrumented server)

No skew since we are on the same machine

CS-SR (async span without response ?) CS-SR-SS-CR with different end points (RPC-span, with possibility of skew) CS-CR with SR-SS child (the case we're trying to solve here)

Possible clock skew since clocks will most probably be desynchronized

We now have two choices:

I tend to prefer the first solution.

fedj commented 7 years ago

After a discussion with @adriancole, it seems that flattening the tree can loose datas (e.g. which span id should we use when flattening data ?). The solution would be to adapt the clock skew algorithm (or at least how we look at the tree). We also need to adapt it in the dependency linker.

mpetazzoni commented 6 years ago

@adriancole, @fedj, any progress made on dealing with clock skew when spans come from different hosts? Even when using ntpd, the clocks are very close, but evidently not close enough:

clock-skew

What's the best way to deal with this when deploying distributed tracing?

codefromthecrypt commented 6 years ago

@adriancole https://github.com/adriancole, @fedj https://github.com/fedj, any progress made on dealing with clock skew when spans come from different hosts? Even when using ntpd, the clocks are very close, but evidently not close enough:

It appears as if you might not be using shared RPC spans. https://github.com/openzipkin/zipkin/issues/1480 tracks some issues around this.

The current algorithm can correct obvious skew, where the child occurs before the parent. This does not, it is just far right.

Definitely this can be corrected is that the child cannot start after the parent completed. This would be better, as it would at least shift the child left to the end of getId. We could make an assumption that when clock skew is present that the RPC child should be contained (or at least attempted to be contained)

open to other ideas, too.

codefromthecrypt commented 6 years ago

TODO here is look if there are any missing cases that aren't in our unit tests yet. they are all javascript now.