pokt-network / pocket-core

Official implementation of the Pocket Network Protocol
http://www.pokt.network
MIT License
209 stars 103 forks source link

[Telemetry] Mainnet Monitoring - Rebase Health Module #1511

Closed jessicadaugherty closed 1 year ago

jessicadaugherty commented 1 year ago

Objective

Following the 0.9.1.1 Chain Halt Post-Mortem, there are action items re: monitoring mainnet (and ideally testnet) that we need to get better coverage of to increase our chances of catching bugs/errors before releases are in production, as well as helping us triage issues in the event of a release/production crisis.

Due to the need to process state dumps and aggregate by network actors, we need a dedicated exporter available rather than making enhancements to an existing exporter.

Origin Document

A health module was designed and built but remained inactive that includes:

Consensus, State and Transaction metrics are the most relevant when triaging crises like a chain halt, while data size and lifecycle metrics help us observe the performance of state size and transitions through the state.

Goals

Deliverable

Non-goals / Non-deliverables

General issue deliverables

Testing Methodology


Creator: @jessicadaugherty

jessicadaugherty commented 1 year ago

Questions

  1. If we need min. 2 state dumps for diffs, how do we handle that with this exporter, or do we always need data from another node runner?
  2. At what frequency is this queried? On an as needed or ongoing basis?
Olshansk commented 1 year ago

@Gustavobelfort Making a "public note to self" for us to sync on this offline.

  1. Look at the branch related to this PR: https://github.com/pokt-network/pocket/issues/360
  2. See the impact / feasibility of rebasing this on top of master
  3. Understand the Go structures being being exported
  4. Retrieve the data as JSON (similar to tendermint endpoints)
  5. Design the promethesus exporter (TBD)
jessicadaugherty commented 1 year ago

Refactored Deliverables

@Gustavobelfort @Olshansk

Gustavobelfort commented 1 year ago

Currently the state of the health module is as follows:

Problems encountered in the codebase

Working properly

Conclusion

After discussing with @iajrz we don't think that the module is ready to be merged in the codebase, some unwanted side effects in pocket-core might pop up if we decide to do so.

Ideally we should review the metric requirements of the health module in order to better design what should be returned, utilize the bits and pieces of the code that are working and either trim out the pieces that do not work or fix them basing ourselves on the design doc created beforehand, only then decide about merging.

Olshansk commented 1 year ago

Closing this out as the work is no longer relevant. We can reference this PR if we ever choose to pick it up again.