Closed sr-gi closed 1 year ago
Tested this in a loop using docker with the following setup:
docker run -it --cpus="1.0" --memory="1g" ubuntu
And I was able to consistently reproduce the error within a few seconds. After slightly increasing the delay I was unable to reproduce it again (after minutes of running). Just for the record, this is how I run it:
#[test]
fn loop_test_manage_retry_while_idle() {
loop {
test_manage_retry_while_idle()
}
}
cargo test retrier::tests::loop_test_manage_retry_while_idle -- --exact
I just understood where the issue was.
So the max_elapsed_time
doesn't actually stop self.run
right away when the max time has passed. What it does is that it waits for self.run
to finish and then doesn't allow a next chance since we exceeded max_elapsed_time
, which if self.run
kept running longer than expected for some reason (low machine spec? OS related reason?) and didn't fail quickly, we might exceed max_elapsed_time
by a little bit.
retry_notify(
ExponentialBackoff {
max_elapsed_time: Some(Duration::from_secs(max_elapsed_time_secs as u64)),
max_interval: Duration::from_secs(max_interval_time_secs as u64),
..ExponentialBackoff::default()
},
|| async { self.run().await },
|err, _| {
log::warn!("Retry error happened with {}. {}", self.tower_id, err);
},
).await;
I just understood where the issue was.
So the
max_elapsed_time
doesn't actually stopself.run
right away when the max time has passed. What it does is that it waits forself.run
to finish and then doesn't allow a next chance since we exceededmax_elapsed_time
, which ifself.run
kept running longer than expected for some reason (low machine spec? OS related reason?) and didn't fail quickly, we might exceedmax_elapsed_time
by a little bit.retry_notify( ExponentialBackoff { max_elapsed_time: Some(Duration::from_secs(max_elapsed_time_secs as u64)), max_interval: Duration::from_secs(max_interval_time_secs as u64), ..ExponentialBackoff::default() }, || async { self.run().await }, |err, _| { log::warn!("Retry error happened with {}. {}", self.tower_id, err); }, ).await;
Does this mean that, in the worse case, we need to wait for about max_elapsed_time + max_interval
? i.e:
max_elapsed_time
minus a fraction of a second has passed and we initiate a new round that lasts max_interval
Does this mean that, in the worse case, we need to wait for about max_elapsed_time + max_interval?
Not max_interval
, but the time it takes to run self.run
which is not strictly defined, but should be fairly small.
max_interval
relates to the backoff wait time between one failing self.run
and another:
run1 (takes as much time as it takes until it fails)
wait period of min(max_interval, exponential_backoff)
start run2 ...
Does this mean that, in the worse case, we need to wait for about max_elapsed_time + max_interval?
Not
max_interval
, but the time it takes to runself.run
which is not strictly defined, but should be fairly small.
max_interval
relates to the backoff wait time between one failingself.run
and another:
run1 (takes as much time as it takes until it fails)
wait period of min(max_interval, exponential_backoff)
start run2 ...
Right, that makes sense. I'll go over all the fix waits and see if there is any other to refine that can be squashed here too.
Rebased to redefine some of the waits and define a wait time for the Retrier, instead of having it hardcoded.
Tested this in a loop for both OSX and ubuntu (low resources) and seems to work fine.
nvm about this one @mariocynicys, we will go with #184 for now since it's less invasive
retrier::tests::test_manage_retry_while_idle
was randomly failing (for Ubuntu) when checking whether the Retrier was idle after giving up on a retry.Some of the tests success is based on waiting time (given we need the retriers to either succeed or fail). Re-define the waits so they are more accurate. Unfortunately, there is no way of perfectly doing this, but the current setup seems not to fail