talaia-labs / rust-teos

The Eye of Satoshi - Lightning Watchtower
https://talaia.watch
MIT License
128 stars 62 forks source link

Adds auto-retry logic to watchtower-plugin #168

Closed sr-gi closed 1 year ago

sr-gi commented 1 year ago

Current State

The current version of the watchtower-plugin for CLN will try to re-send data to towers that are not reachable for various reasons using an exponential backoff strategy. However, once the strategy is given up on, the tower will be flagged as unreachable and not further retried (unless the user manually triggers it through the retrytower rpc).

Improvements

This PR adds auto-retry logic, so once a specific unreachable tower is given up on, the retrier is not removed but remains idle and it's launched again after a specific delay (controlled by the watchtower-auto-retry-delay). From this point on, retriers will only be removed from the RetryManager is they fail due to a permanent error, but not due to a transient one (in which case they will be auto-retried).

In order to achieve this, the retriers now return more meaningful errors (instead of strings) that are used to distinguish why it had to give up retrying, if so.

The way data is shared between the WTClient and the RetryManager has also been updated in order to achieve this. Beforehand, data were sent one by one in (tower_id, locator) tuples, now, it is sent in (tower_id, RevocationData) tuples, in order to distinguish between retries/bootstraps and sending data because of a channel update.

Bug fixes

This also fixes a bug related to re-registering, where data regarding pending and invalid appointments was being swept from memory after re-registering.

sr-gi commented 1 year ago

Fixed the typos, will work on the fundamental issues next

sr-gi commented 1 year ago

Added the changes requested on discord so the state of the retriers can be accessed by the WTClient.

I've slightly adapted the code so it also takes advantage of this to, for instance, send the retry notification to the RetryManager without the need to include any associated data to it.

This should be ready modulo discussing the register errors severity (and any found issue or bug ofc).

sr-gi commented 1 year ago

Pushed a fix of the race condition that was going on with the Retrier. @mariocynicys if you're happy with all the changes I'll squash the fixing commits and give this a final look before merging.

mariocynicys commented 1 year ago

Pushed a fix of the race condition that was going on with the Retrier. @mariocynicys if you're happy with all the changes I'll squash the fixing commits and give this a final look before merging.

Cool, there is only one thing missing tho. sub and register errors being permanent or tempo?

My take on this is (note: i might be a bit biased to the temporary/idle side xD), sub error go tempo/idle and register errors can be either tempo or permanent:

If we go by this, we would need to introduce some sort of logic saying whether each individual variant in the RetryError enum is temporary or fatal. We can then add this as a followup but keep [sub => tempo & register => either permanent or tempo (have no strong choice)] for now.

Edit: Sorry, I completely missed 3c2b292. It implements what I wrote in this comment already.

sr-gi commented 1 year ago

LGTM

Rebased. This should be good for a final diff and check.