jepsen-io / jepsen

A framework for distributed systems verification, with fault injection
6.84k stars 718 forks source link

Docker clock changes cause test failures #568

Open deriamis opened 1 year ago

deriamis commented 1 year ago

We (MongoDB) have seen behavior in recent versions of Jepsen that cause test failures in about 60% of runs due to odd clock skew. Specifically, the clock is incorrect between runs, which causes subsequent apt-update commands in the containers to fail due to certificate validation failure. Our Jepsen tests run with clock skew disabled, so we aren't sure why the clock is being changed, but it looks like the recent change to make test node containers privileged and have ALL capabilities is how it's happening.

Interestingly, this only seems to be a problem on test hosts that have an NTP client running. When we run the tests on our virtual workstations, which do not have an NTP client running, the tests succeed. It seems that the clock skew in the test node containers is racing with the NTP client somehow, which causes the observed failures. However, as stated above, we have not been able to determine so far why the clock skew occurs in the first place.