r-lib / later

Schedule an R function or formula to run after a specified period of time.
https://r-lib.github.io/later
Other
137 stars 27 forks source link

Callbacks can execute in wrong order #150

Closed wch closed 3 years ago

wch commented 3 years ago

Originally from https://github.com/rstudio/shiny/issues/3429. I've been able to reproduce this problem on Docker for Mac, using versions 2.5.0.0 and above. Older versions seem fine. Also, I can't seem to reproduce it when running R directly on the Mac, or R on Linux, or R in Docker on Linux.

When the C API is used and tasks are scheduled very quickly, sometimes they do not execute in the correct order.

I've posted example code at https://gist.github.com/wch/d96b3c66644282a312c5df22433d46cd

wch commented 3 years ago

Looking at the results from the example code above, this is what happens in each test run:

Each test run generates some data. The expected values would be something like this:

   idx value
1    1     1
2    2     2
3    3     3
4    4     4
5    5     5
6    6     6
...

idx represents the order in which a callback is run. value represents the order in which a callback was scheduled. So each row should be the same.

Here are some actual results, with the "good" (matching) rows removed, and only the mismatched rows displayed:

     idx value
1  91172 91696
2  91173 91697
3  91174 91698
4  91175 91699
5  91176 91700
6  91177 91701
7  91178 91702
8  91179 91172
9  91180 91173
10 91181 91703
11 91182 91174
12 91183 91704
13 91184 91705
14 91185 91706
15 91186 91707
...

I suspect (but don't know for sure) that these mixups happen because of poor time resolution in the Docker container running in the VM, and because there's a bug with the callbackId for the Callback objects, either when generating them or when comparing them (when the when values for two Callback objects are identical, the callbackId is used to break ties and decide which one executes first.) https://github.com/r-lib/later/blob/9335226596f85de890e6f8c610d126a855d7ad01/src/callback_registry.h#L26-L33

wch commented 3 years ago

I made a branch which recorded the timestamps of Callback objects as they are created, into a vector<Timestamp>. https://github.com/r-lib/later/compare/check-order

What I found is that the timestamps increase monotonically, except at the locations where the callback executes in the wrong order. For example, notice how these values increase until index 9:

> times[486410:486430] - times[1]
 [1] 14.90152 14.90182 14.90186 14.90188 14.90193 14.90196 14.90197 14.90200
 [9] 14.86805 14.86815 14.86816 14.86821 14.86836 14.86837 14.86857 14.86867
[17] 14.86868 14.86871 14.86875 14.86882 14.86882

So this means that the problem is that, when the Timestamp objects are created, they don't always have increasing values.

This is the code that's run when they're created. It boils down to a call to gettimeofday(), and then a conversion from a timeval to a timespec: https://github.com/r-lib/later/blob/9096d0c3e7f44a0fb5d3991724978afbc1bf36c8/src/timestamp_unix.cpp#L7-L21

I wonder if this is due to a bug in the Docker VM.

wch commented 3 years ago

It looks like the decreasing gettimeofday() results may be due to time drift correction in the VM.

Here are some SO questions where the user experienced a similar problem with VirtualBox: https://stackoverflow.com/questions/53081264/gettimeofday-returns-inaccurate-result https://stackoverflow.com/questions/4801122/how-to-stop-time-from-running-backwards-on-linux/4801176

The second one links to an issue in the VirtualBox tracker: https://www.virtualbox.org/ticket/7915 And a blog post titled "gettimeofday() should never be used to measure time": https://blog.habets.se/2010/09/gettimeofday-should-never-be-used-to-measure-time.html

wch commented 3 years ago

Here's an example adapted from one of the SO links above, which demonstrates that gettimeofday() is nonmonotonic -- the clock can go backwards. It basically calls gettimeofday() repeatedly and checks if the time is less than the previous time; if so, it prints out some information about it. With the code as written, it runs for 20 seconds and then returns.

docker run --rm -ti rocker/shiny /bin/bash
R
install.packages(c('cpp11', 'later', 'brio', 'callr', 'cli', 'decor', 'desc',
  'tibble', 'vctrs'))

library(cpp11)
cpp_source(
  code = '
#include <unistd.h>
#include <stdio.h>
#include <sys/time.h>

// Returns time in milliseconds
unsigned long GetTickCount() {
  struct timeval start, end;
  unsigned long myTickCount;
  gettimeofday(&start, NULL);
  myTickCount = start.tv_sec*1000 + start.tv_usec/1000;
  return myTickCount;
}

[[cpp11::register]]
void time_test() {
  int i = 0;
  unsigned long start = GetTickCount();
  unsigned long a = start;
  unsigned long b;
  // Run for 20 seconds
  while(a - start < 20000) {
    a = GetTickCount();
    // Sleep for 0.1ms
    usleep(100);
    b = GetTickCount();
    if (a > b) {
      printf(
        "iteration: %ld, a-start: %ld a: %ld, b:%ld, b-a=%ld\\n",
        i, a-start, a, b, b-a
      );
    }
    i++;
  }
}
')

time_test()
#> iteration: 93320, a-start: 12493 a: 1624722553189, b:1624722553168, b-a=-21

The result above is what happens in Docker 2.5.0.0 on Mac. After about 12.5 seconds, it recorded an instance where the clock went backward by 21 milliseconds.

I think the solution may be to use clock_gettime(CLOCK_MONOTONIC, ...).

jcheng5 commented 3 years ago

Great find!! Maybe CLOCK_BOOTTIME on Linux? https://stackoverflow.com/a/3527632

wch commented 3 years ago

The Docker blog has a post about how they handle time drift on Mac: https://www.docker.com/blog/addressing-time-drift-in-docker-desktop-for-mac/

As long as CLOCK_BOOTTIME is monotonic, it should fix the issue.