refactor: isolate faulty channels and retry channel task on faults

luketchang commented 2 years ago

High Level Changes:

Allows kathy, relayer, and processor to isolate failures at the channel level and retry channel task instead of crashing whole agent if one channel fails
Note that isolating channel failures is not relevant to the updater (updater only touches home)
This behavior did not seem desirable for the watcher

Code Changes

Agent::run no longer borrows &self and instead takes an agent-specific <Agent>Channel struct that defines all data types needed to run one home <> replica channel
Agent::run_many builds an <Agent>Channel struct and hands this off to an Agent::run task; if the run task errors out, it will log error and try to start it again instead of returning error to top level
Watcher and updater ignore this pattern, as they must overwrite Agent::run_all

TODO: [ ] add unit tests to mock faulty RPC [x] add exponential backoff for retries [x] metric to track channel number of channel faults

Closes #161

luketchang commented 2 years ago

@arnaud036 @yourbuddyconner lets run this PR in dev before merging

luketchang commented 2 years ago

luketchang commented 2 years ago

nomad-xyz / nomad-monorepo