Defines Retrier polling waiting time as a constant and fixes tests

sr-gi commented 1 year ago

The pooling time for the Retrier was hardcoded to 1, let's at least use a constant for that.

Also, retrier::tests::test_manage_retry_while_idle was randomly failing (for Ubuntu) when checking whether the Retrier was idle after giving up on a retry. This is due to the time of running a round not being taken into account.

~~Turns out this is super brittle, so I'm going to modify the timers as little as possible for now.~~ Many timers have been modified to make actual sense.

Superseeds #182 (Close #182)

Edit:

@mariocynicys found why this was brittle in https://github.com/talaia-labs/rust-teos/pull/184#issuecomment-1405635662, looks like it isn't anymore 😄

sr-gi commented 1 year ago

I saw some instances in the tests that wait on API_DELAY instead of POLLING_TIME.

Funny tho, the instance linked above actually waits zero seconds but yet the test succeeds the test every time (without ever waiting for polling time duration).

Oh yeah, I corrected that on the other PR and forgot about it here, let me see if I can patch it without breaking it 😅

mariocynicys commented 1 year ago

More on why instances with API_TIME never failed in the CI:

// Init the manager
let task = tokio::spawn(async move {
    RetryManager::new(
        wt_client_clone,
        rx,
        MAX_ELAPSED_TIME,
        LONG_AUTO_RETRY_DELAY,
        MAX_INTERVAL_TIME,
    )
    .manage_retry()
    .await
});
// Send the data
tx.send((tower_id, RevocationData::Fresh(appointment.locator)))
    .unwrap();

// Wait for the elapsed time and check how the tower status changed
tokio::time::sleep(Duration::from_secs((API_DELAY / 2.0) as u64)).await;
// Check
assert!(wt_client
    .lock()
    .unwrap()
    .get_retrier_status(&tower_id)
    .unwrap()
    .is_running());

It looks like that tokio::spawn always starts the async future late enough after tx.send is performed, thus the retry manager always hits the Ok(...) branch when try_recv()ing from the channel.

To make the test fail you can try adding a little sleep before tx.send (tokio::time::sleep(Duration::from_secs_f64(0.01);), this will delay tx.send enough that the async future is started first and will fail try_recv()ing and go to the Err(...) branch and wait POLLING_TIME, then we wait (API_DELAY / 2.0) as u64) seconds which is actually zero and the assertion will fail since POLLING_TIME wasn't over yet.

We can reduce the number of cases where we sleep POLLING_TIME if we always do tx.send before tokio::spawning the retry manager task. If we send before spawning we will be sure that the spawned task won't wait POLLING_TIME before processing our sent data. Not really important for correctness, but will only lower the testing time by a few seconds or so.

sr-gi commented 1 year ago

We can reduce the number of cases where we sleep POLLING_TIME if we always do tx.send before tokio::spawning the retry manager task. If we send before spawning we will be sure that the spawned task won't wait POLLING_TIME before processing our sent data. Not really important for correctness, but will only lower the testing time by a few seconds or so.

Yes! That's it!

Looks like I can finally make the timings make proper sense.

Awesome catch 🎉 🎉 🎉

sr-gi commented 1 year ago

Addressed the comments and slightly reduced MAX_RUN_TIME (from 0.5 to 0.2).

Tested it in a loop for both OSX and low-resource Ubuntu without issues.

mariocynicys commented 1 year ago

I tried to repro the failed test in the CI and wasn't able to (in a loop for about 12h). But i was able to repro it when i lowered MAX_RUN_TIME to 0.01 in just a couple of runs. So it should happen a lot when MAX_RUN_TIME is decreased.

They only possible logical explanation i can think of is that CI boxes spend so much time in self.run, more than 0.2s for some reason, and I think the reason is that so many threads are run at once (each test function thread + threads spawned from inside the test functions/code) that makes the self.run thread starve sometimes and not finish quickly. Might also be network issues but not sure if that can actually take that much time without timing out.

I'm not totally sure whether the 2 explanations are actually the reasons for this failure (or even make much sense :confused:), but i am sure that tests is logically correct at this point and through experimentation it also says the same.

mariocynicys commented 1 year ago

Might also be network issues but not sure if that can actually take that much time without timing out.

#[tokio::main]
async fn main() {
let start = tokio::time::Instant::now();
let res = reqwest::Client::new()
.get(format!("{}", "http://unreachable.tower"))
.send()
.await;
println!("Response: {:?}", res);
println!("Took: {:?}", tokio::time::Instant::now() - start);
}

Looks like it's possible for a request to take that much time before getting a DNS error (at least on my slow network). So network issues might be one valid reason.

Note that it takes that much time 0.34s only in the first run, subsequent runs finish in less than 50ms. Probably the OS caches DNS results or something.

talaia-labs / rust-teos

Defines Retrier polling waiting time as a constant and fixes tests #184