Closed sr-gi closed 1 year ago
I saw some instances in the tests that wait on API_DELAY instead of POLLING_TIME.
Funny tho, the instance linked above actually waits zero seconds but yet the test succeeds the test every time (without ever waiting for polling time duration).
Oh yeah, I corrected that on the other PR and forgot about it here, let me see if I can patch it without breaking it 😅
More on why instances with API_TIME never failed in the CI:
// Init the manager
let task = tokio::spawn(async move {
RetryManager::new(
wt_client_clone,
rx,
MAX_ELAPSED_TIME,
LONG_AUTO_RETRY_DELAY,
MAX_INTERVAL_TIME,
)
.manage_retry()
.await
});
// Send the data
tx.send((tower_id, RevocationData::Fresh(appointment.locator)))
.unwrap();
// Wait for the elapsed time and check how the tower status changed
tokio::time::sleep(Duration::from_secs((API_DELAY / 2.0) as u64)).await;
// Check
assert!(wt_client
.lock()
.unwrap()
.get_retrier_status(&tower_id)
.unwrap()
.is_running());
It looks like that tokio::spawn
always starts the async future late enough after tx.send
is performed, thus the retry manager always hits the Ok(...)
branch when try_recv()
ing from the channel.
To make the test fail you can try adding a little sleep before tx.send
(tokio::time::sleep(Duration::from_secs_f64(0.01);
), this will delay tx.send
enough that the async future is started first and will fail try_recv()
ing and go to the Err(...)
branch and wait POLLING_TIME
, then we wait (API_DELAY / 2.0) as u64)
seconds which is actually zero and the assertion will fail since POLLING_TIME
wasn't over yet.
We can reduce the number of cases where we sleep POLLING_TIME
if we always do tx.send
before tokio::spawn
ing the retry manager task. If we send before spawning we will be sure that the spawned task won't wait POLLING_TIME
before processing our sent data. Not really important for correctness, but will only lower the testing time by a few seconds or so.
We can reduce the number of cases where we sleep
POLLING_TIME
if we always dotx.send
beforetokio::spawn
ing the retry manager task. If we send before spawning we will be sure that the spawned task won't waitPOLLING_TIME
before processing our sent data. Not really important for correctness, but will only lower the testing time by a few seconds or so.
Yes! That's it!
Looks like I can finally make the timings make proper sense.
Awesome catch 🎉 🎉 🎉
Addressed the comments and slightly reduced MAX_RUN_TIME
(from 0.5 to 0.2).
Tested it in a loop for both OSX and low-resource Ubuntu without issues.
I tried to repro the failed test in the CI and wasn't able to (in a loop for about 12h).
But i was able to repro it when i lowered MAX_RUN_TIME
to 0.01
in just a couple of runs. So it should happen a lot when MAX_RUN_TIME
is decreased.
They only possible logical explanation i can think of is that CI boxes spend so much time in self.run
, more than 0.2s
for some reason, and I think the reason is that so many threads are run at once (each test function thread + threads spawned from inside the test functions/code) that makes the self.run
thread starve sometimes and not finish quickly. Might also be network issues but not sure if that can actually take that much time without timing out.
I'm not totally sure whether the 2 explanations are actually the reasons for this failure (or even make much sense :confused:), but i am sure that tests is logically correct at this point and through experimentation it also says the same.
Might also be network issues but not sure if that can actually take that much time without timing out.
#[tokio::main] async fn main() { let start = tokio::time::Instant::now(); let res = reqwest::Client::new() .get(format!("{}", "http://unreachable.tower")) .send() .await; println!("Response: {:?}", res); println!("Took: {:?}", tokio::time::Instant::now() - start); }
Looks like it's possible for a request to take that much time before getting a DNS error (at least on my slow network). So network issues might be one valid reason.
Note that it takes that much time
0.34s
only in the first run, subsequent runs finish in less than50ms
. Probably the OS caches DNS results or something.
The pooling time for the Retrier was hardcoded to 1, let's at least use a constant for that.
Also,
retrier::tests::test_manage_retry_while_idle
was randomly failing (for Ubuntu) when checking whether the Retrier was idle after giving up on a retry. This is due to the time of running a round not being taken into account.Turns out this is super brittle, so I'm going to modify the timers as little as possible for now.Many timers have been modified to make actual sense.Superseeds #182 (Close #182)
Edit:
@mariocynicys found why this was brittle in https://github.com/talaia-labs/rust-teos/pull/184#issuecomment-1405635662, looks like it isn't anymore 😄