near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.33k stars 629 forks source link

Document observability principles #7806

Open matklad opened 2 years ago

matklad commented 2 years ago

I think this discussion makes it clear we need to figure out what we want our o11y to look like in the end. For every style of output we have, there should be a strong motivating case. Otherwise we will eventually end up with a third, fourth, and tenth largely-equivalent introspection mechanism. Lack of focus is never a great state to be in.


My personal opinion is that high-throughput logs are unusable in isolation. By their nature, they are difficult to scan, visualize, contextualize and interpret. More messages is a solution only to one of those issues. All that considered, targeting no more than ½ messages per second would actually make them logs objectively better.

What we do have is opentelemetry which elegantly deals with all of these problems, and more. Our integration of it is also quite advanced now. We should just require that any deployment includes the otel sidecar and necessary configuration to send all the collected data somewhere.

OTOH, if the desire is to make it operationally super straightforward to deploy the thing, then we should look into toning down our otel efforts and look into making logs usable (which among other things would mean adding the missing context and introducing tooling that would solve problems solved by the grafana+otel stack.) Doing both feels like a waste of time and resources

Originally posted by @nagisa in https://github.com/near/nearcore/issues/7762#issuecomment-1268242353

matklad commented 2 years ago

I think this discussion makes it clear we need to figure out what we want our o11y to look like in the end.

+1, I think we should formulate and write down specific goals and principles we use to decide what to log then. I think we have high-level style details (prefer key values over string formatting) nailed in style.md, but it would be great to have guidence on when to use Info vs Debug, etc. This should includes stuff like "we target x/second INFO mesasges on stderr and y/second events we sent over otel".

We now have https://near.github.io/nearcore/ deployed, so adding an "Observability" chapter to practices section here seems like a good, specific next action here.

Specifically, I'd love to capture answers to the following questions, which I don't know yet:

nagisa commented 2 years ago

This is probably as good place as any to start a discussion to reach consensus in the first place.

How and what to instrument/measure

I heard a concerning thing recently that dashboards are something people look at to figure out if anything is wrong, rather than to find out what has gone wrong in response to some stimulus. That is an inefficient use of anybody’s time. Computers were built to do that for us!

My proposal is that we must first establish “SLA”s (for the lack of a better term) that we want to see maintained. Examples off top of my head:

Then, for each SLA we can think about what kind of data we need to verify it is being upheld. Use that data to set up alerts for SLA violations. These alerts don’t necessarily need to wake up somebody at night – it really depends on the severity. But we should still keep a log of all violations and review those periodically.

Only with those pieces in place we can make informed guesses as to what sort of data would be useful for “post-mortem” investigation of the violations. If time/gas ratio was exceeding the expectations, what could be the cause of it? Undercharging? Storage? Compilation? Execution? Instrument all of these compounding parts. This data can be visualized, but I wouldn’t spend particularly much time on preemptively guessing what visualizations are useful. Dashboards constructed that way tend to produce overwhelmingly low signal-to-noise ratios in my experience.

Unfortunately, from what I can tell we’ve been approaching this from the opposite direction (generate data, then figure out what to do with it.) This leads to us collecting data we don’t use at all, data that turns out to not quite represent what’s truly important/necessary and fewer opportunities in automation.

Granularity of the data

There are very few places in neard where gathering granular data (spans, events) is going to have a meaningful performance impact. As a rule of thumb, any operation that occurs at a rate of less than 10Hz, is infrequent enough that instrumenting it with spans and events will be an obvious choice.

For the more frequent operations, metrics are a good alternative. They’ve a lower overhead on both the execution throughput and the data storage, but have an unfortunate impact on the code in various ways – setting up and maintaining metrics is reasonably code-heavy, and they’re comparably less flexible in following the changes to the code structure, compared to instrumentation (events/spans).

In the past we have strongly preferred metrics in nearcore primarily because they were the only kind of data ready for visualization. Today we have Grafana Tempo and similar technologies that can query and visualize span/event data well too, so this is no longer a good reason to prefer one over the other.

Except maybe for local deployments – deploying a local prometheus instance is trivial. Grafana Tempo is not straightforward to set up.

Logs

In my experience tracing largely replaces logs in their utility. The only benefit logs really have is that they are trivial to consume in a single isolated situation. You can read them with a text editor and grep with a log viewer. This all falls apart when you have more than a single binary/service/machine (not to mention that people shouldn’t be sshing into production in the first place). Log aggregation is a thing, but due to logs’ unstructured nature ability to query and interpret the data in an automatic way is severely limited.

I’d like to argue that logs should be limited by default to the most severe issues and information that is specific to a concrete deployment of an instance. Any incident mitigation workflow that inherently involves somebody accessing a specific machine and looking at its log is going to be that much slower. And error-prone (yes, I did once spend half an hour looking at logs on the wrong machine.)

For everything else besides the most basic signs of life (and lack of trouble), figure out how to deploy the full telemetry pipeline.

Development

We want to see output during development too. Debuggers are often still quite unfamiliar workflow to many and for many good reasons too. So being able to introspect the system without deploying a full fat grafana locally is still very important. Opentelemetry is probably not super onerous to deploy, but we could have scripts/containers/etc. and such that set everything up for you.

nagisa commented 2 years ago

NB: I intend to keep adding additional comments here with my thoughts as examples come up. But I do think that we should eventually set ourselves a deadline of some sort and maybe have some discussions that would drive us towards reaching a conclusion.

cc @nikurt you may also have some opinions given that you’ve been the person spearheading observability efforts here.