open-telemetry / oteps

OpenTelemetry Enhancement Proposals
https://opentelemetry.io
Apache License 2.0
339 stars 164 forks source link

Proposal: Reduce clock-skew issues in mobile and other client-side trace sources #154

Open bryce-b opened 3 years ago

bryce-b commented 3 years ago

I'm creating this ticket per discussion in the OpenTelemetry maintainers' meeting 05/10/2021

Clock-skew will always be a problem with distributed tracing, but the degree of skew that occurs on unmanaged devices (by 'unmanaged' I mean devices outside of the software provider's control) is untenable.

Screen Shot 2021-04-26 at 10 44 06 AM

This screenshot shows the degree of clock skew between a mobile device and a backend server while tracing a synchronous request. The mobile device is using an automatically sync'd system clock, but the degree of skew could be much, much worst, as the clock can be set at the whim of the mobile phone's owner (think days, months, years of skew).

I'd like to brainstorm some solutions to this problem. Some possible solutions could be:

Oberon00 commented 3 years ago

(Clocks are hard. Except for Linux' clock_gettime(CLOCK_BOOTTIME) which may not be available in the target runtime/language, I do not know any other clock implementation that goes in lockstep with the epoch time. Especially on client systems, the typical monotonic clocks stop when the CPU is suspended (e.g. with a closed notebook lid, but I imagine on battery-driven mobile devices it occurs even more). The realtime clock on the other hand is subject to be changed by the user on a whim.)

Oberon00 commented 3 years ago

Without having delved deeper into the topic, I don't think it is feasible to get sub-second synchronization across distributed systems with anything short of full-fledged NTP (which takes a few minutes too sync precisely). For a precision in the order of a few seconds, it may be enough to send the "current" time with each request, so the receiver can calculate the offset between the current time of the sender and it's own current time.

iNikem commented 3 years ago

it may be enough to send the "current" time with each request, so the receiver can calculate the offset between the current time of the sender and it's own current time.

This is more or less what we did in Plumbr

ivomagi commented 3 years ago

There is a blogpost, exposing conceptually how the clock skew was handled back in the days: https://plumbr.io/blog/monitoring/time-in-distributed-systems

t2t2 commented 6 months ago

In case this issue gets active once again, archive.org link for above blog post: https://web.archive.org/web/20210123103641/https://plumbr.io/blog/monitoring/time-in-distributed-systems