[EPIC]: improve instrumentation to reduce manual testing

p0mvn commented 1 year ago

Background

As an engineer, I want to be able to have the most vulnerable and inaccessible code areas instrumented so that I do not need to spend as much time QAing each release manually.

Currently, we seldom use any telemetry infrastructure at our disposal.

The only places where we utilize telemetry are:

Instead, we heavily rely on manual testing before a release. This is a time-killer because to test a PR that we suspect may be buggy, someone needs to reproduce that either on localosmosis or testnet.

Ideally, we should have our codebase instrumented with various telemetry counters, gauges, and histograms. The most common risk areas should be identified and instrumented to maximize observability and reduce the need for manual testing.

For example, in addition to the existing minted_tokens gauge in x/mint we might want to have gauges to observe the proportions of the minted tokens distributed to each module. If commit A causes certain metric to go out of pre-defined bounds, we should get alerted. Such automation would instill more confidence into our releases and reduce the need for manual testing,

Other Requirements and Blockers

Telemetry for QA purposes can only be useful when we have a long-running burn node with a connection to a specific commit.

Therefore, we need #1014 to be complete to unlock the full potential of the change proposed here.

From conversations with @niccoloraspa , there is progress on that front being made.

Suggested Design

In this epic, I would like to focus particularly on instrumentation.

We should review every metric type and when that type is useful: https://tomgregory.com/the-four-types-of-prometheus-metrics/

Then, we should analyze our codebase, and recollect knowledge of the release pain points for the most vulnerable and inaccessible areas that could benefit from observability to instrument them with metrics.

Acceptance Criteria

most vulnerable and inaccessible parts of our codebase are instrumented with metrics
the need for manual testing is reduced

ValarDragon commented 1 year ago

Why don't we make a hook to go from event emissions to telemetry within the SDK?

If the goal is flagging code path + counter + addtl metadata, feels like that should do it (unless I'm missing what context your thinking of this helping in)

p0mvn commented 1 year ago

From my understanding, the semantics of events and metrics are different, and they serve different purposes. Events are primarily used by integrators and returned from common ABCI responses while telemetry can be emitted from anywhere based on needs for observability.

What I'm suggesting is to analyze our codebase to identify areas where instrumenting it with metrics/telemetry would make sense. In some of these areas, we might not necessarily want to emit events.

osmosis-labs / osmosis