hyperlane-xyz / hyperlane-monorepo

The home for Hyperlane core contracts, sdk packages, and other infrastructure
https://hyperlane.xyz
Other
295 stars 312 forks source link

QuorumProvider - get_quorum_number should be more resilient to provider errors #1398

Open tkporter opened 1 year ago

tkporter commented 1 year ago

Noticed an annoying issue - one of the providers in our quorum set was consistently down. When the relayer would try to send a transaction, it'd try to get the gas price using eth_gasPrice, which uses get_quorum_number behind the scenes. Because get_quorum_number doesn't eagerly resolve once it's received a quorum of responses, it would wait until the inner provider that wasn't working would either return Ok or Err. The issue is the inner providers themselves are retrying providers - so that eth_gasPrice call would wait as the inner provider would try an eth_gasPrice call, sleep a little, try again and get an error, sleep a little, try again... etc. This resulted in like ~30s of sleeping before the relayer could actually fire off its transaction. We should avoid this somehow!

Our original motivation for not eagerly returning was to be able to weed out outliers for e.g. eth_blockNumber. In that case, we'd want to return the lowest number of the highest numbers returned. This way we weed out very dramatic outliers and know that we have a quorum of providers that are aware of a certain block height

Some ideas:

  1. Add a timeout within get_quorum_number
  2. Once we have gotten a quorum of successful responses, perform a quick check -- e.g. is the std dev of all responses <= X? If so, return. Otherwise wait till we get more responses
  3. any other ideas ?
nambrot commented 6 months ago

@tkporter @daniel-savu does this still apply?

tkporter commented 6 months ago

It does still mostly apply

tkporter commented 6 months ago

just definitely low prio