systemd / systemd

The systemd System and Service Manager
https://systemd.io
GNU General Public License v2.0
13.3k stars 3.8k forks source link

timesyncd doesn't limit rate of requests #17470

Open mlichvar opened 4 years ago

mlichvar commented 4 years ago

systemd version the issue has been seen with

246.6

Used distribution

Fedora

Linux kernel version used (uname -a)

5.8.16-300.fc33.x86_64

Unexpected behaviour you saw

timesyncd sending NTP requests at a very high rate

Steps to reproduce the problem

  1. set NTP in /etc/systemd/timesyncd.conf to 127.0.0.1 to avoid flooding public servers
  2. systemctl restart systemd-timesyncd
  3. start tcpdump -i lo port 123 in another terminal
  4. while true; do date -s '+0 sec'; done

I'm running a few public NTP servers in pool.ntp.org. Recently, I had a broken client that was sending ~2000 requests per second for three weeks continously, consuming almost as much traffic as all other clients combined. From the packet capture it looked like timesyncd. I wasn't able to contact the owner of the address, so I'm not sure what exactly was happening there. This is just my best guess.

I think the problem is that timesyncd sends a new request whenever it detects the system clock was set, using a timer specified with the TFD_TIMER_CANCEL_ON_SET flag. Normally there shouldn't be anything else except timesyncd touching the clock, but that's unreasonable to expect it to be always true. There can be another application, e.g. a different NTP/PTP client, or maybe just a test suite that is running some clock-related tests. In the case I saw I suspect it was a 32-bit system where the system clock somehow got out of the 32-bit range (aka Y2038), i.e. the timer expired and caused an endless stream of events on the descriptor. I can reproduce this on a Debian i386 machine.

Please add a separate safety mechanism to limit the rate of NTP requests, one that doesn't use timers. For example, there could be a check with each request that the tv_sec value of the CLOCK_MONOTONIC clock has changed. If something is messing with the CLOCK_REALTIME clock, there will be at most one request per second.

poettering commented 4 years ago

TFD_TIMER_CANCEL_ON_SET is borked on some older kernels. That was fixed in the kernel, and it's not really possible to work around in userspace.

That said, we probably should add a packet ratelimiter as safety net, just in case.

Looks remotely related to #17461

mlichvar commented 4 years ago

I'm not sure what issue the kernel had with TFD_TIMER_CANCEL_ON_SET, but with current kernels there still seem to be the problem that it's not possible to specify an infinite timeout. If the absolute timer is set to the maximum 32-bit time_t value (Jan 19 03:14:07 UTC 2038), the timer will expire when the system clock passes that time, no matter the application is interested only in the cancellation due to clock step. The kernel can go up to the year 2262. Applications need to be able to handle this case correctly.

With the latest glibc it seems clock_gettime() returns with EOVERFLOW when the current time doesn't fit into time_t, so even if they don't, hopefully they will abort instead of creating new timers that immediately expire.

Anyway, this is just one specific cause of timesyncd sending requests at a high rate. The rate limiting in timesyncd should be general enough to handle them all.

hart-NTP commented 5 months ago

The volunteer pool.ntp.org service in the Philippines is currently seriously degraded by what appears to be another timesyncd blasting at line rate problem. [1] The reason it appears to be timesyncd is the transmit timestamp fraction is always less than 1 billion.

Please prioritize putting a guardrail in place to prevent queries more often than every few seconds. I suggest a small prime number so there is a fingerprint that will help identify the software involved.

For the benefit of all of digital civilization that increasingly depends on pool.ntp.org for myriad infrastructure, it is critical that all NTP client software take measures to ensure line-rate flooding of NTP packets is as unlikely as possible. Thank you for your efforts to keep timesyncd out of future headlines ;)

[1] https://community.ntppool.org/t/certain-servers-are-not-replying/3238/23?u=davehart