talaia-labs / rust-teos

The Eye of Satoshi - Lightning Watchtower
https://talaia.watch
MIT License
128 stars 62 forks source link

Redefine waiting times in Retrier #182

Closed sr-gi closed 1 year ago

sr-gi commented 1 year ago

retrier::tests::test_manage_retry_while_idle was randomly failing (for Ubuntu) when checking whether the Retrier was idle after giving up on a retry.

Some of the tests success is based on waiting time (given we need the retriers to either succeed or fail). Re-define the waits so they are more accurate. Unfortunately, there is no way of perfectly doing this, but the current setup seems not to fail

sr-gi commented 1 year ago

Tested this in a loop using docker with the following setup:

docker run -it --cpus="1.0" --memory="1g" ubuntu

And I was able to consistently reproduce the error within a few seconds. After slightly increasing the delay I was unable to reproduce it again (after minutes of running). Just for the record, this is how I run it:

#[test]
fn loop_test_manage_retry_while_idle() {
    loop {
        test_manage_retry_while_idle()
    }
}
cargo test retrier::tests::loop_test_manage_retry_while_idle -- --exact
mariocynicys commented 1 year ago

I just understood where the issue was.

So the max_elapsed_time doesn't actually stop self.run right away when the max time has passed. What it does is that it waits for self.run to finish and then doesn't allow a next chance since we exceeded max_elapsed_time, which if self.run kept running longer than expected for some reason (low machine spec? OS related reason?) and didn't fail quickly, we might exceed max_elapsed_time by a little bit.

retry_notify(
    ExponentialBackoff {
        max_elapsed_time: Some(Duration::from_secs(max_elapsed_time_secs as u64)),
        max_interval: Duration::from_secs(max_interval_time_secs as u64),
        ..ExponentialBackoff::default()
    },
    || async { self.run().await },
    |err, _| {
        log::warn!("Retry error happened with {}. {}", self.tower_id, err);
    },
).await;
sr-gi commented 1 year ago

I just understood where the issue was.

So the max_elapsed_time doesn't actually stop self.run right away when the max time has passed. What it does is that it waits for self.run to finish and then doesn't allow a next chance since we exceeded max_elapsed_time, which if self.run kept running longer than expected for some reason (low machine spec? OS related reason?) and didn't fail quickly, we might exceed max_elapsed_time by a little bit.

retry_notify(
    ExponentialBackoff {
        max_elapsed_time: Some(Duration::from_secs(max_elapsed_time_secs as u64)),
        max_interval: Duration::from_secs(max_interval_time_secs as u64),
        ..ExponentialBackoff::default()
    },
    || async { self.run().await },
    |err, _| {
        log::warn!("Retry error happened with {}. {}", self.tower_id, err);
    },
).await;

Does this mean that, in the worse case, we need to wait for about max_elapsed_time + max_interval? i.e:

max_elapsed_time minus a fraction of a second has passed and we initiate a new round that lasts max_interval

mariocynicys commented 1 year ago

Does this mean that, in the worse case, we need to wait for about max_elapsed_time + max_interval?

Not max_interval, but the time it takes to run self.run which is not strictly defined, but should be fairly small.

max_interval relates to the backoff wait time between one failing self.run and another:

sr-gi commented 1 year ago

Does this mean that, in the worse case, we need to wait for about max_elapsed_time + max_interval?

Not max_interval, but the time it takes to run self.run which is not strictly defined, but should be fairly small.

max_interval relates to the backoff wait time between one failing self.run and another:

  • run1 (takes as much time as it takes until it fails)
  • wait period of min(max_interval, exponential_backoff)
  • start run2 ...

Right, that makes sense. I'll go over all the fix waits and see if there is any other to refine that can be squashed here too.

sr-gi commented 1 year ago

Rebased to redefine some of the waits and define a wait time for the Retrier, instead of having it hardcoded.

Tested this in a loop for both OSX and ubuntu (low resources) and seems to work fine.

sr-gi commented 1 year ago

nvm about this one @mariocynicys, we will go with #184 for now since it's less invasive