Hermes cannot `start` if it encounters problems with refreshing a client

adizere commented 2 years ago

Crate

ibc-relayer ibc-relayer-cli

Summary of Bug

This is a bug reported by a relayer operator.

The symptom reported is as follows

when a client is "too old", hermes enter in an infinite loop and don't care of the other chains

Logs below

Feb 11 14:27:16 hermes[28227]: 2022-02-11T14:27:16.117113Z  INFO ThreadId(95) refresh{client=07-tendermint-502 src_chain=dig-1 dst_chain=cosmoshub-4}: [dig-1 -> cosmoshub-4:07-tendermint-502] client requires refresh
Feb 11 14:27:16 hermes[28227]: 2022-02-11T14:27:16.120718Z  WARN ThreadId(95) refresh{client=07-tendermint-502 src_chain=dig-1 dst_chain=cosmoshub-4}: task encountered ignorable error: error raised while updating client on chain dig-1: failed building header with error: Light client error for RPC address dig-1: rpc error: response error: Internal error: height 728068 is not available, lowest height is 753967 (code: -32603)
Feb 11 14:27:16 hermes[28227]: 2022-02-11T14:27:16.713544Z  INFO ThreadId(01) Hermes has started
Feb 11 14:27:17 hermes[28227]: 2022-02-11T14:27:17.121817Z  INFO ThreadId(95) refresh{client=07-tendermint-502 src_chain=dig-1 dst_chain=cosmoshub-4}: [dig-1 -> cosmoshub-4:07-tendermint-502] client requires refresh
Feb 11 14:27:17 hermes[28227]: 2022-02-11T14:27:17.125273Z  WARN ThreadId(95) refresh{client=07-tendermint-502 src_chain=dig-1 dst_chain=cosmoshub-4}: task encountered ignorable error: error raised while updating client on chain dig-1: failed building header with error: Light client error for RPC address dig-1: rpc error: response error: Internal error: height 728068 is not available, lowest height is 753967 (code: -32603)
Feb 11 14:27:18 hermes[28227]: 2022-02-11T14:27:18.126185Z  INFO ThreadId(95) refresh{client=07-tendermint-502 src_chain=dig-1 dst_chain=cosmoshub-4}: [dig-1 -> cosmoshub-4:07-tendermint-502] client requires refresh
Feb 11 14:27:18 hermes[28227]: 2022-02-11T14:27:18.129670Z  WARN ThreadId(95) refresh{client=07-tendermint-502 src_chain=dig-1 dst_chain=cosmoshub-4}: task encountered ignorable error: error raised while updating client on chain dig-1: failed building header with error: Light client error for RPC address dig-1: rpc error: response error: Internal error: height 728068 is not available, lowest height is 753967 (code: -32603)

I get an infinite loop of the "refresh" message after this.

The fix was

I have removed the DIG chain from our config and it's back in operation

Version

$ hermes version
hermes 0.11.0+288a1ab4

Steps to Reproduce

Unclear. A suggestion below:

The first step we need to do is to get a better understanding of the problem, to do so we need to find the exact RPC call that Hermes makes which triggers this error:

Feb 11 14:27:17 hermes[28227]: 2022-02-11T14:27:17.121817Z INFO ThreadId(95) refresh{client=07-tendermint-502 src_chain=dig-1 dst_chain=cosmoshub-4}: [dig-1 -> cosmoshub-4:07-tendermint-502] client requires refresh Feb 11 14:27:17 hermes[28227]: 2022-02-11T14:27:17.125273Z WARN ThreadId(95) refresh{client=07-tendermint-502 src_chain=dig-1 dst_chain=cosmoshub-4}: task encountered ignorable error: error raised while updating client on chain dig-1: failed building header with error: Light client error for RPC address dig-1: rpc error: response error: Internal error: height 728068 is not available, lowest height is 753967 (code: -32603)
It is likely that we can instrument the relayer logic for the client refresh task to achieve the following:
- insert a return Err(..) instead of the RPC call (from step 1) that triggers the error, to mimic the conditions we see in production
- this should make the refresh task fail, and consequently will make the Hermes start task hang on that failure, thus reproducing the problem

Acceptance Criteria

[ ] Hermes start can proceed despite client refresh task failures

For Admin Use

[X] Not duplicate issue
[X] Appropriate labels applied
[X] Appropriate milestone (priority) applied
[ ] Appropriate contributors tagged
[ ] Contributor assigned/self-assigned

ancazamfir commented 2 years ago

It's not clear to me what the problem is at this point. From the error messages it seems that the block for the trusted height required for the client refresh cannot be found, maybe block pruning was enabled.

I get an infinite loop of the "refresh" message after this.

This happens every second until the client expires (this will go on for ~1/3rd of trusting period). The refresh is retried every second for TaskError::Ignore error (which is the case here). However, the relayer is responsive, other workers can proceed, packets are relayed, etc. I tested this with current master, maybe things were different beginning of the year. True, it is an annoying log message (warn!)and it will go on for 3 days for the typical 14 day trusting period (3wk unbonding).

For this we could:

figure out the pruning window W (in number of blocks) and refresh at either 1/3rd of the trusting period duration (before client expires) or when W blocks have been created since latest trusted (before the latest trusted block is pruned). Finding W is not straightforward as it's a combination of many parameters (evidence in genesis, app and block pruning params in app.toml, etc), some of which we cannot access. We could add another chain parameter to the hermes config.
dig out the reason from error and consider this a fatal error (TaskError::Fatal) and stop trying spawning the refresh
make the message a debug :)

adizere commented 2 years ago

However, the relayer is responsive, other workers can proceed, packets are relayed, etc.

This is great to hear. The biggest issue signalled here was that Hermes couldn't start; glad that's not the case anymore, thank you for the investigation Anca! Removing the "high" priority label.

dig out the reason from error and consider this a fatal error (TaskError::Fatal) and stop trying spawning the refresh

This seems like the most appropriate way to improve logs. At the same time, we have operators who explicitly told us to implement "stubborn" retrying, so not sure we should proceed with the fix at all. Maybe we could consider the error fatal & stop retrying but only after the first 100 retries or something like that?

informalsystems / hermes