We're getting some really bad intermittency from a pair of tests in
the periodic job enqueuer [1] [2] that need fixing ASAP since they're
failing at least one matrix entry in almost every build.
Looking at the tests, they add a pair of jobs after starting the client,
each of which operates on a 500 ms interval, and one with RunOnStart.
The test case waits for the initial RunOnStart job to appear, then
waits again for the next job that wasn't RunOnStart.
I think what's happening is that since 500ms is fairly long for a
computer, it's not crazy long, so it's actually possible for the tests
in CI to be operating slowly enough that by the time the enqueuer gets
around to its first run, it actually has been 500ms already, and
therefore both jobs have been enqueued, even the non-RunOnStart one.
Another thing that I realize looking closer at these test cases is that
because of the way they're structured, they're slower than hell. Each
one requires a wait of 500 ms because it's waiting for the
non-RunOnStart job, so running them at high -count iterations is a
total non-starter.
Here, I'm suggesting that we simplify things by:
Increase period by 10x from 500ms to 5s.
Don't bother waiting for the non-RunOnStart job to be enqueued.
The fact that the RunOnStart job has been confirmed is a pretty good
chance that both jobs were indeed added properly, so I don't think we
have to wait for the other one too. This resolves the intermittency
problem, but also makes each test case probably on the order of 10,000
times faster.
@bgentry Ran the test matrix 4x times and although I got a different intermittent error (will try to fix that separately), this pull seems to fix its target.
We're getting some really bad intermittency from a pair of tests in the periodic job enqueuer [1] [2] that need fixing ASAP since they're failing at least one matrix entry in almost every build.
Looking at the tests, they add a pair of jobs after starting the client, each of which operates on a 500 ms interval, and one with
RunOnStart
. The test case waits for the initialRunOnStart
job to appear, then waits again for the next job that wasn'tRunOnStart
.I think what's happening is that since 500ms is fairly long for a computer, it's not crazy long, so it's actually possible for the tests in CI to be operating slowly enough that by the time the enqueuer gets around to its first run, it actually has been 500ms already, and therefore both jobs have been enqueued, even the non-
RunOnStart
one.Another thing that I realize looking closer at these test cases is that because of the way they're structured, they're slower than hell. Each one requires a wait of 500 ms because it's waiting for the non-
RunOnStart
job, so running them at high-count
iterations is a total non-starter.Here, I'm suggesting that we simplify things by:
RunOnStart
job to be enqueued.The fact that the
RunOnStart
job has been confirmed is a pretty good chance that both jobs were indeed added properly, so I don't think we have to wait for the other one too. This resolves the intermittency problem, but also makes each test case probably on the order of 10,000 times faster.Fixes #409.
[1] https://github.com/riverqueue/river/actions/runs/9770889034/job/26972789182?pr=413 [2] https://github.com/riverqueue/river/actions/runs/9770889034/job/26972789334?pr=413