hashicorp / raft-boltdb

Raft backend implementation using BoltDB
Mozilla Public License 2.0
645 stars 113 forks source link

BoltDB metrics are mostly counters not timers or gauges #36

Open banks opened 12 months ago

banks commented 12 months ago

We currently emit BoltDB timers like TxStats.WriteTime as a sample as if we are sampling each transaction's elapsed commit time.

However bbolt internally is just recording a counter of all all the time spent writing transactions. By recording it as a summary we get something kinda meaningless. The best we can do is take the max of this and then treat it like a counter with something like irate to get the time-per-second spent writing data to disk. But that's confusing and wasteful. We should just emit it as a counter and updated docs to match.

We document it in Vault and Consul docs as "time spent writing in milliseconds" which is likely to confuse anyone who tries to interpret this data!

Slightly less wrong - we document and record several metrics like raft.boltdb.txstats.pageAlloc as gauges even though they have the same semantics as counters. We do at least correctly note that they are a count of all allocs since process start here but it's confusing that we record that as a gauge when the only useful thing to be done with it is treat it like a counter and see the rate at which it's increasing over time! Consider the doc difference between number of spills (counter) and this.