Open mlichvar opened 4 years ago
TFD_TIMER_CANCEL_ON_SET is borked on some older kernels. That was fixed in the kernel, and it's not really possible to work around in userspace.
That said, we probably should add a packet ratelimiter as safety net, just in case.
Looks remotely related to #17461
I'm not sure what issue the kernel had with TFD_TIMER_CANCEL_ON_SET
, but with current kernels there still seem to be the problem that it's not possible to specify an infinite timeout. If the absolute timer is set to the maximum 32-bit time_t value (Jan 19 03:14:07 UTC 2038), the timer will expire when the system clock passes that time, no matter the application is interested only in the cancellation due to clock step. The kernel can go up to the year 2262. Applications need to be able to handle this case correctly.
With the latest glibc it seems clock_gettime() returns with EOVERFLOW when the current time doesn't fit into time_t, so even if they don't, hopefully they will abort instead of creating new timers that immediately expire.
Anyway, this is just one specific cause of timesyncd sending requests at a high rate. The rate limiting in timesyncd should be general enough to handle them all.
The volunteer pool.ntp.org service in the Philippines is currently seriously degraded by what appears to be another timesyncd blasting at line rate problem. [1] The reason it appears to be timesyncd is the transmit timestamp fraction is always less than 1 billion.
Please prioritize putting a guardrail in place to prevent queries more often than every few seconds. I suggest a small prime number so there is a fingerprint that will help identify the software involved.
For the benefit of all of digital civilization that increasingly depends on pool.ntp.org for myriad infrastructure, it is critical that all NTP client software take measures to ensure line-rate flooding of NTP packets is as unlikely as possible. Thank you for your efforts to keep timesyncd out of future headlines ;)
[1] https://community.ntppool.org/t/certain-servers-are-not-replying/3238/23?u=davehart
systemd version the issue has been seen with
Used distribution
Linux kernel version used (
uname -a
)Unexpected behaviour you saw
Steps to reproduce the problem
I'm running a few public NTP servers in pool.ntp.org. Recently, I had a broken client that was sending ~2000 requests per second for three weeks continously, consuming almost as much traffic as all other clients combined. From the packet capture it looked like timesyncd. I wasn't able to contact the owner of the address, so I'm not sure what exactly was happening there. This is just my best guess.
I think the problem is that timesyncd sends a new request whenever it detects the system clock was set, using a timer specified with the
TFD_TIMER_CANCEL_ON_SET
flag. Normally there shouldn't be anything else except timesyncd touching the clock, but that's unreasonable to expect it to be always true. There can be another application, e.g. a different NTP/PTP client, or maybe just a test suite that is running some clock-related tests. In the case I saw I suspect it was a 32-bit system where the system clock somehow got out of the 32-bit range (aka Y2038), i.e. the timer expired and caused an endless stream of events on the descriptor. I can reproduce this on a Debian i386 machine.Please add a separate safety mechanism to limit the rate of NTP requests, one that doesn't use timers. For example, there could be a check with each request that the
tv_sec
value of theCLOCK_MONOTONIC
clock has changed. If something is messing with theCLOCK_REALTIME
clock, there will be at most one request per second.