trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.35k stars 2.98k forks source link

Verify clock skewness between nodes #3730

Open findepi opened 4 years ago

findepi commented 4 years ago

Clock skewness between nodes can lead to subtle and hard to diagnose problems (eg https://github.com/prestosql/presto/issues/3594) We should monitor and report if skewness is detected.

findepi commented 4 years ago

As we monitor workers from the coordinator, we could report current time. We could then ensure reported time BETWEEN time before request - grace AND time after request + grace with very small "grace".

@electrum wdyt?

dain commented 4 years ago

I think this can be handled in the HeartbeatFailureDetector. This already monitors every worker, and can disable workers that do not meet a requirement.

electrum commented 3 years ago

If there is significant clock skew outside the JWT window on either the coordinator or worker, the coordinator would not be able to talk to the worker, since the worker would reject the coordinator's JWT. We could possibly change the response for this case to be structured, such that the coordinator would know the specific reason and clock skew value.

The HTTP Date header, or perhaps a custom header with a UNIX timestamp representation, might also be useful for this.

lozbrown commented 1 month ago

Hi

Is there a way to increase clock skew tolerance?

Last night I lost all my workers due to having 0.76 of second clock skew, whilst i will look into the causes of clock skew. 0 milisecond tolerance for clock skew seems a little harsh

2024-09-04T23:40:26.794499133Z stderr F io.jsonwebtoken.ExpiredJwtException: JWT expired 766 milliseconds ago at 2024-09-04T23:40:26.000Z. Current time: 2024-09-04T23:40:26.766Z. Allowed clock skew: 0 milliseconds.