vcos_semaphore_wait_timeout() and clock adjusting

mdevaev commented 3 years ago

Describe the bug vcos_semaphore_wait_timeout() uses CLOCK_REALTIME and sem_timedwait(). If the time was adjusted during vcos_semaphore_wait_timeout() (via NTP for example) then it will either wait longer than the specified timeout (if the clock is moved back), or it will not wait for it (if forward). When using sem_timedwait() this is a known problem (1, 2). For sem_timedwait() on Linux, this is the expected behavior (although in QNX there is a sem_timedwait_monotonic() that uses CLOCK_MONOTONIC) since it accepts an absolute timestamp. For vcos_semaphore_wait_timeout(), I think this behavior is incorrect, because the timeout value must be set to a relative value.

As a fix, I could use a check on my side: compare monotonic timestamps before and after calling vcos_semaphore_wait_timeout() and run this again if the timeout was not reached. But this solution will not help if the clock has been moved to the past and the wait may be increased.

To reproduce The problem is very rare and I don't have a case for reproducing it. But I think my analysis seems to point to the problem fairly accurately. I encountered this bug when using ustreamer. It uses a semaphore when encoding via OMX.

Expected behaviour The timeout should not be affected by the clock adjusting.

Actual behaviour Adjusting the clock affects the timeout.

mdevaev commented 3 years ago

Another possible workaround is busyloop (pseudo-c):

long double deadline_ts = get_now_monotonic() + timeout;
VCOS_STATUS_T sem_status;

while (true) {
    sem_status = vcos_semaphore_trywait(sem);
    if (sem_status != VCOS_EAGAIN || get_now_monotonic() > deadline_ts) {
        return sem_status;
    }
    usleep(1000);
}

The obvious problem is usleep() and a large number of unnecessary calls of vcos_semaphore_trywait() (i.e. sem_trywait()).

mtlynch commented 3 years ago

There is a consistent repro for this bug during boot. The Pi doesn't have a built-in clock, so when you shut it off, its clock will lag behind real-world time until the Pi syncs with NTP during boot.

If you run an application that calls vcos_semaphore_wait_timeout during boot, there's a race condition that makes it easy to trigger this issue. For example, imagine the following sequence:

Raspberry Pi powers on and begins booting
Pi OS launches FooApp as a systemd service during boot
FooApp calls vcos_semaphore_wait_timeout with a 30 second timeout
NTP service syncs time, advancing the system clock forward by 2 minutes
FooApp instantly gets back VCOS_EAGAIN from vcos_semaphore_wait_timeout even though 30 seconds have not yet elapsed

This bug can also trigger at any other time that the Pi adjusts its clock forward or backwards, but it's easiest to trigger during boot, as there's generally a significant time jump forward for the time that the Pi has been powered off.

raspberrypi / userland

vcos_semaphore_wait_timeout() and clock adjusting #658