Each task has a queue of operations. Whenever we pop from this queue to look for some operations that may be ready for work, we update the hyperlane_submitter_queue_length metric for the given queue name.
Imo the metric is most useful as an indication of how many operations are in the given task/stage, and not how many are specifically in the queue. It's an implementation detail that an operation in a particular task can either be in a queue or having work done on it.
It results in these jagged metrics and makes alert conditions a bit more funky
the upper part in that screenshot shows the jaggedness n in a post-32 world, the lower one was the neutron context where we were doing batches of 4 until like 4:10pm when I moved to 32 and the jaggedness is more extreme
I think this was fine when we were just popping 1 at a time but wonder if we should change to avoid this. Imo the messages when popped off are still in the queue stage and the metrics are confusing
Solution
Instead of updating the metric when popping from the queue, update the metric when a message has a change (changes stage, or gets a new status label). The end goal being fewer jagged metrics. We can still use the confirm queue / operations confirmed metric as an indication of messages being worked on
Problem
We have 3 different tasks in the op_submitter:
Each task has a queue of operations. Whenever we pop from this queue to look for some operations that may be ready for work, we update the
hyperlane_submitter_queue_length
metric for the given queue name.Imo the metric is most useful as an indication of how many operations are in the given task/stage, and not how many are specifically in the queue. It's an implementation detail that an operation in a particular task can either be in a queue or having work done on it.
Now that we have big batch sizes - because we update the queue length metric as a part of
pop_many
(https://github.com/hyperlane-xyz/hyperlane-monorepo/blob/main/rust/agents/relayer/src/msg/op_submitter.rs#L204), and only push not ready ones back on https://github.com/hyperlane-xyz/hyperlane-monorepo/blob/main/rust/agents/relayer/src/msg/op_submitter.rs#L243, we end up really frequently popping off like 32 of these and taking a while till we add them back to the queue bc we wait for ready ones to prepareIt results in these jagged metrics and makes alert conditions a bit more funky
the upper part in that screenshot shows the jaggedness n in a post-32 world, the lower one was the neutron context where we were doing batches of 4 until like 4:10pm when I moved to 32 and the jaggedness is more extreme
I think this was fine when we were just popping 1 at a time but wonder if we should change to avoid this. Imo the messages when popped off are still in the queue stage and the metrics are confusing
Solution
Instead of updating the metric when popping from the queue, update the metric when a message has a change (changes stage, or gets a new status label). The end goal being fewer jagged metrics. We can still use the confirm queue / operations confirmed metric as an indication of messages being worked on