Open adizere opened 2 years ago
It's not clear to me what the problem is at this point. From the error messages it seems that the block for the trusted height required for the client refresh cannot be found, maybe block pruning was enabled.
I get an infinite loop of the "refresh" message after this.
This happens every second until the client expires (this will go on for ~1/3rd of trusting period). The refresh is retried every second for TaskError::Ignore
error (which is the case here).
However, the relayer is responsive, other workers can proceed, packets are relayed, etc. I tested this with current master, maybe things were different beginning of the year. True, it is an annoying log message (warn!
)and it will go on for 3 days for the typical 14 day trusting period (3wk unbonding).
For this we could:
TaskError::Fatal
) and stop trying spawning the refreshHowever, the relayer is responsive, other workers can proceed, packets are relayed, etc.
This is great to hear. The biggest issue signalled here was that Hermes couldn't start; glad that's not the case anymore, thank you for the investigation Anca! Removing the "high" priority label.
dig out the reason from error and consider this a fatal error (TaskError::Fatal) and stop trying spawning the refresh
This seems like the most appropriate way to improve logs. At the same time, we have operators who explicitly told us to implement "stubborn" retrying, so not sure we should proceed with the fix at all. Maybe we could consider the error fatal & stop retrying but only after the first 100 retries or something like that?
Crate
ibc-relayer
ibc-relayer-cli
Summary of Bug
This is a bug reported by a relayer operator.
The symptom reported is as follows
Logs below
The fix was
Version
Steps to Reproduce
Unclear. A suggestion below:
The first step we need to do is to get a better understanding of the problem, to do so we need to find the exact RPC call that Hermes makes which triggers this error:
It is likely that we can instrument the relayer logic for the client refresh task to achieve the following:
return Err(..)
instead of the RPC call (from step 1) that triggers the error, to mimic the conditions we see in productionstart
task hang on that failure, thus reproducing the problemAcceptance Criteria
start
can proceed despite client refresh task failuresFor Admin Use