Add an aggregate metric for the theoretical write capacity

mkeeler commented 2 years ago

The new metric emits a sample of the number of logs per second boltdb could store if all those log write operations looked like the one currently being measured. That means that another operation would have the same number of logs per batch and that the actual txn Commit took the same amount of time. While no two operations will be identical, taking the average of the sample/summary emitted should provide a good picture of what Consul could handle with the current types of write operations being performed.

It is expected that this value will fluctuate with changes in size of the logs flowing through consul and how many logs get batched into one storage op.

If someone wanted to monitor this I think they would want to know when the actual write rate exceeds 75% of this metrics value. That could be due to an increased number of writes, or a degradation in disk performance which causes similar writes to slow down. Regardless of the cause, if you are getting close to the limit or see a drastic change in the metric it could be indicative of another issue which requires investigation.

acpana commented 2 years ago

[nit] Any chance we could add some comments around what the metric writeCapacity is and how folks should interact with it? (the PR description is great imo for this purpose)

hashicorp-cla commented 2 years ago

All committers have signed the CLA.

mkeeler commented 2 years ago

I added some info in the README about how to interpret metrics.

hashicorp / raft-boltdb

Add an aggregate metric for the theoretical write capacity #28