Open fedj opened 7 years ago
- CS-SR-SS-CR with same end point (not sure it exists but it could be earlier version of local component)
I'd clarify this as loopback. In some cases, a node makes a remote call to itself, which can actually be a latency bug!
Did I miss any use case?
pretty good list! I'd guess that in some cases there will be no-op related to skew on one side or the other.
CS-CR (non-instrumented server) CS-SR-SS-CR with same end point (loopback call) SR-SS (root span) CS (async span without response with non-instrumented server)
No skew since we are on the same machine
CS-SR (async span without response ?) CS-SR-SS-CR with different end points (RPC-span, with possibility of skew) CS-CR with SR-SS child (the case we're trying to solve here)
Possible clock skew since clocks will most probably be desynchronized
We now have two choices:
I tend to prefer the first solution.
After a discussion with @adriancole, it seems that flattening the tree can loose datas (e.g. which span id should we use when flattening data ?). The solution would be to adapt the clock skew algorithm (or at least how we look at the tree). We also need to adapt it in the dependency linker.
@adriancole, @fedj, any progress made on dealing with clock skew when spans come from different hosts? Even when using ntpd, the clocks are very close, but evidently not close enough:
What's the best way to deal with this when deploying distributed tracing?
@adriancole https://github.com/adriancole, @fedj https://github.com/fedj, any progress made on dealing with clock skew when spans come from different hosts? Even when using ntpd, the clocks are very close, but evidently not close enough:
It appears as if you might not be using shared RPC spans. https://github.com/openzipkin/zipkin/issues/1480 tracks some issues around this.
The current algorithm can correct obvious skew, where the child occurs before the parent. This does not, it is just far right.
Definitely this can be corrected is that the child cannot start after the parent completed. This would be better, as it would at least shift the child left to the end of getId. We could make an assumption that when clock skew is present that the RPC child should be contained (or at least attempted to be contained)
open to other ideas, too.
TODO here is look if there are any missing cases that aren't in our unit tests yet. they are all javascript now.
In order to tackle #1480 on the clock skew part, I think that we first need to list possible cases:
Did I miss any use case?