Open CharlieC3 opened 2 years ago
This would be the offending line:
for myself - this also seems to be off, not sure why since it was working previously.
stacks_node_active_miners_total
If possible I'd also like the stacks_node_mempool_outstanding_txs metric to be re-assessed.
Currently it will increment or decrement the number of outstanding TXs for a stacks-node. However, it assumes that when a node starts it has 0 TXs in its mempool. Often when a node starts it may have dozens, hundreds, or thousands of TXs in its mempool. As a result, it's pretty common to see this metric dip heavily into the negative digits for a while after bootup.
It's also very common to see this metric show numbers 6X higher than what the API's mempool
endpoint shows, which is making me think the decrement logic either isn't always being triggered when it should, or it's not in all code paths.
Ideally this metric would represent the number of TXs in a stacks-nodes mempool accurately.
number of neighbors is sometimes not being returned @CharlieC3 , these are the 2 metrics you were mentioning correct? edit - i wonder if this is related to the boot from snapshot metrics weirdness
stacks_node_neighbors_outbound
stacks_node_neighbors_inbound
@wileyj We're actually querying the /v2/neighbors
endpoint directly and converting that JSON data to a prometheus metric because it's treated as an availability test. It's possible this is not an issue with the stacks-node's reporting. I think we can exclude these two metrics from this issue, at least for now until we have clearer evidence that it's a reporting issue.
@wileyj We're actually querying the
/v2/neighbors
endpoint directly and converting that JSON data to a prometheus metric because it's treated as an availability test. It's possible this is not an issue with the stacks-node's reporting. I think we can exclude these two metrics from this issue, at least for now until we have clearer evidence that it's a reporting issue.
got it, thanks - i think i'll track these values anyways while testing to see if anything strange is reported (at the very least, it'll show us it's reproducable)
stacks_node_mempool_outstanding_txs
doesn't seem to be an easy way to get this value on startup (greg has a binary he wrote, but it's not a quick exec). another option i'll try here is to simply reset the gauge on startup
@wileyj I think the gauge currently does get reset on startup, but that causes the aforementioned problem where the gauge can dip into negative digits if some mempool TXs leave soon after the node is started.
I'm surprised this number of TXs can't be fetched from the mempool database. Why isn't something like
select count(txid) from mempool;
a valid count of mempool TXs?
Why isn't something like
select count(txid) from mempool;
a valid count of mempool TXs?
The mempool database keeps transactions around even after they've been "confirmed". The reason for this is to deal with forks/reorgs (which happen pretty frequently, like 10% of blocks): a transaction that is confirmed in one block may not be confirmed in the other fork, so miners should include the transaction in that fork as well. Because of this, the mempool keeps those transactions around until they've become sufficiently old.
@kantai I see - Is there a key @wileyj can filter on in that table to determine the count of TXs for the node's mempool? Does the node itself have a way of tracking the number of TXs in its mempool?
No, the node doesn't have a way of tracking the number of transactions in its mempool in a strict sense. That's why that prometheus metric that attempts to track pending transactions behaves in the way that it currently does.
Thanks @kantai.
It's sounding like there's no way the stacks_node_mempool_outstanding_txs
metric can easily be resumed after a restart without making larger changes. Getting accurate access to this metric would be valuable for Hiro, although I'm not sure how big of a change it would entail, so it's likely out of scope for this issue. It seems an additional attribute would have to be added to each TX being tracked in the mempool database.
@CharlieC3 https://github.com/stacks-network/stacks-blockchain/pull/3033
not addressed in this PR is the mempool metric - that's going to be a heavier lift so we can look at that later
I see #3033 got merged, so I'm going to close this out. If there's a need for a separate mempool metric, please open an issue for that instead.
Reopening because we're still seeing this issue in the latest 2.05 release.
Describe the bug When spinning up a stacks-node a few hundred blocks behind the tip, the Prometheus metrics report the initial block at boot but this figure is rarely updated until the stacks-node has finished syncing to the tip. This will make the node appear to be stalled or delayed for an hour or more when it has actually been downloading blocks and progressing along the blockchain the whole time.
For example, the Prometheus metrics pulled from a stacks-node below show the node stuck at block![](https://cdn.discordapp.com/attachments/816365424349282334/928774006784540702/Screen_Shot_2022-01-06_at_5.16.03_PM.png)
43941
for over an hour, then suddenly jumping up to the tip height for that time. However our logs show the stacks-node was steadily appending blocks during this "delayed" period, progressing incrementally over the 1+ hour.Steps To Reproduce
Expected behavior The Prometheus metrics should update the block height figure soon after the new block has been added.