hyperlane-xyz / hyperlane-monorepo

The home for Hyperlane core contracts, sdk packages, and other infrastructure
https://hyperlane.xyz
Other
247 stars 283 forks source link

Relayer queue metrics should only be updated when an operation changes to a different stage/task #4068

Open tkporter opened 4 weeks ago

tkporter commented 4 weeks ago

Problem

We have 3 different tasks in the op_submitter:

  1. prepare
  2. submit
  3. confirm

Each task has a queue of operations. Whenever we pop from this queue to look for some operations that may be ready for work, we update the hyperlane_submitter_queue_length metric for the given queue name.

Imo the metric is most useful as an indication of how many operations are in the given task/stage, and not how many are specifically in the queue. It's an implementation detail that an operation in a particular task can either be in a queue or having work done on it.

Now that we have big batch sizes - because we update the queue length metric as a part of pop_many (https://github.com/hyperlane-xyz/hyperlane-monorepo/blob/main/rust/agents/relayer/src/msg/op_submitter.rs#L204), and only push not ready ones back on https://github.com/hyperlane-xyz/hyperlane-monorepo/blob/main/rust/agents/relayer/src/msg/op_submitter.rs#L243, we end up really frequently popping off like 32 of these and taking a while till we add them back to the queue bc we wait for ready ones to prepare

It results in these jagged metrics and makes alert conditions a bit more funky

image

the upper part in that screenshot shows the jaggedness n in a post-32 world, the lower one was the neutron context where we were doing batches of 4 until like 4:10pm when I moved to 32 and the jaggedness is more extreme

I think this was fine when we were just popping 1 at a time but wonder if we should change to avoid this. Imo the messages when popped off are still in the queue stage and the metrics are confusing

Solution

Instead of updating the metric when popping from the queue, update the metric when a message has a change (changes stage, or gets a new status label). The end goal being fewer jagged metrics. We can still use the confirm queue / operations confirmed metric as an indication of messages being worked on

tkporter commented 4 weeks ago

Thread with some context https://discord.com/channels/935678348330434570/1242873947293356042/1255192067584557086