Prometheus metrics not updating when quickly downloading blocks

CharlieC3 commented 2 years ago

Describe the bug When spinning up a stacks-node a few hundred blocks behind the tip, the Prometheus metrics report the initial block at boot but this figure is rarely updated until the stacks-node has finished syncing to the tip. This will make the node appear to be stalled or delayed for an hour or more when it has actually been downloading blocks and progressing along the blockchain the whole time.

For example, the Prometheus metrics pulled from a stacks-node below show the node stuck at block 43941 for over an hour, then suddenly jumping up to the tip height for that time. However our logs show the stacks-node was steadily appending blocks during this "delayed" period, progressing incrementally over the 1+ hour.

Steps To Reproduce

Step up a stacks-node
Compare the block height recorded via Prometheus metrics vs logs

Expected behavior The Prometheus metrics should update the block height figure soon after the new block has been added.

kantai commented 2 years ago

This would be the offending line:

https://github.com/blockstack/stacks-blockchain/blob/8d1646047b41be7097783f3c2362f9c76c39df9e/src/chainstate/coordinator/mod.rs#L702

wileyj commented 2 years ago

for myself - this also seems to be off, not sure why since it was working previously.

stacks_node_active_miners_total

CharlieC3 commented 2 years ago

If possible I'd also like the stacks_node_mempool_outstanding_txs metric to be re-assessed.

Currently it will increment or decrement the number of outstanding TXs for a stacks-node. However, it assumes that when a node starts it has 0 TXs in its mempool. Often when a node starts it may have dozens, hundreds, or thousands of TXs in its mempool. As a result, it's pretty common to see this metric dip heavily into the negative digits for a while after bootup.

It's also very common to see this metric show numbers 6X higher than what the API's mempool endpoint shows, which is making me think the decrement logic either isn't always being triggered when it should, or it's not in all code paths.

Ideally this metric would represent the number of TXs in a stacks-nodes mempool accurately.

wileyj commented 2 years ago

number of neighbors is sometimes not being returned @CharlieC3 , these are the 2 metrics you were mentioning correct? edit - i wonder if this is related to the boot from snapshot metrics weirdness

stacks_node_neighbors_outbound
stacks_node_neighbors_inbound

CharlieC3 commented 2 years ago

@wileyj We're actually querying the /v2/neighbors endpoint directly and converting that JSON data to a prometheus metric because it's treated as an availability test. It's possible this is not an issue with the stacks-node's reporting. I think we can exclude these two metrics from this issue, at least for now until we have clearer evidence that it's a reporting issue.

wileyj commented 2 years ago

@wileyj We're actually querying the /v2/neighbors endpoint directly and converting that JSON data to a prometheus metric because it's treated as an availability test. It's possible this is not an issue with the stacks-node's reporting. I think we can exclude these two metrics from this issue, at least for now until we have clearer evidence that it's a reporting issue.

got it, thanks - i think i'll track these values anyways while testing to see if anything strange is reported (at the very least, it'll show us it's reproducable)

wileyj commented 2 years ago

stacks_node_mempool_outstanding_txs

doesn't seem to be an easy way to get this value on startup (greg has a binary he wrote, but it's not a quick exec). another option i'll try here is to simply reset the gauge on startup

CharlieC3 commented 2 years ago

@wileyj I think the gauge currently does get reset on startup, but that causes the aforementioned problem where the gauge can dip into negative digits if some mempool TXs leave soon after the node is started.

I'm surprised this number of TXs can't be fetched from the mempool database. Why isn't something like

select count(txid) from mempool;

a valid count of mempool TXs?

kantai commented 2 years ago

Why isn't something like

select count(txid) from mempool;

a valid count of mempool TXs?

The mempool database keeps transactions around even after they've been "confirmed". The reason for this is to deal with forks/reorgs (which happen pretty frequently, like 10% of blocks): a transaction that is confirmed in one block may not be confirmed in the other fork, so miners should include the transaction in that fork as well. Because of this, the mempool keeps those transactions around until they've become sufficiently old.

CharlieC3 commented 2 years ago

@kantai I see - Is there a key @wileyj can filter on in that table to determine the count of TXs for the node's mempool? Does the node itself have a way of tracking the number of TXs in its mempool?

kantai commented 2 years ago

No, the node doesn't have a way of tracking the number of transactions in its mempool in a strict sense. That's why that prometheus metric that attempts to track pending transactions behaves in the way that it currently does.

CharlieC3 commented 2 years ago

Thanks @kantai.

It's sounding like there's no way the stacks_node_mempool_outstanding_txs metric can easily be resumed after a restart without making larger changes. Getting accurate access to this metric would be valuable for Hiro, although I'm not sure how big of a change it would entail, so it's likely out of scope for this issue. It seems an additional attribute would have to be added to each TX being tracked in the mempool database.

wileyj commented 2 years ago

@CharlieC3 https://github.com/stacks-network/stacks-blockchain/pull/3033

not addressed in this PR is the mempool metric - that's going to be a heavier lift so we can look at that later

jcnelson commented 2 years ago

I see #3033 got merged, so I'm going to close this out. If there's a need for a separate mempool metric, please open an issue for that instead.

CharlieC3 commented 1 year ago

Reopening because we're still seeing this issue in the latest 2.05 release.

stacks-network / stacks-core

Prometheus metrics not updating when quickly downloading blocks #2987