littlemanco / the-golden-path.net

A template for writing a new tool or service.
0 stars 0 forks source link

Telemetry #13

Open andrewhowdencom opened 4 years ago

andrewhowdencom commented 4 years ago

Use OpenTelemetry. Logs are not really for production diagnostics.


Where OpenTelemetry spans are marked as error, be sure to annotate the span with some log line that indicates why it was an error.

Additionally, review upstream guidance such as:

andrewhowdencom commented 4 years ago

Relevant:

http://www.brendangregg.com/usemethod.html

andrewhowdencom commented 4 years ago

Line numbers for go programs:

andrewhowdencom commented 4 years ago

https://golang.org/pkg/net/http/pprof/

andrewhowdencom commented 4 years ago

Tracing:

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/resource/semantic_conventions/k8s.md

Should include:

application.version=""
deployment.id="uuid/hash" # Deployment also include configuration.

Given:

There should be

Sample problems might be:

Given a feature flag:

Given an A/B Test:


Given:

There should be:

(Or so fourth. Connection semantics)


Success criteira


Can signals (SIGTERM, SIGINT) be modelled as RPCs? At any rate, need to figure out whether to log them or trace them. Probably trace


Given a connection pool, need to model:

Things like:

andrewhowdencom commented 4 years ago

We should make clients a first class primitive within the server scope. Catches changes we anticipate, but clients do not -- things like API breakages.

Smth like a "client telemetry library" would be interesting (or just adoption of a standard format at which metrics will be surfaced to the service).

This has been done through the peer.* and span.kind tags.

andrewhowdencom commented 4 years ago

Should monitor responses for dependencies. FOr example, if this service is dependent on Google Maps (or so), monitor the response rate for Google Maps.

This is true for all clients, unless there's a shared overlap they can consume.

andrewhowdencom commented 4 years ago

Need language runtime specifics; things like:

andrewhowdencom commented 4 years ago

Discoverability

In Repo

It'd be interesting to see if it's possible to create additional exporters for metrics/tracing/logs that describes when a given thing is used (especially metrics, which are otherwise hard to discover) and what it means.

Rendering it to markdown and adding it to TELEMETRY.md (for example) would be dope.

Cross Repo

a URI that resolves (i.e. tags.dt.o.littleman.co/${TAG_NAME}) resolves the issue associated with how to describe this stuff.

dt→ distributed tracing o → observability tags ... well, → tags

Short TLDs are a big ugly though.

andrewhowdencom commented 4 years ago

Time series should have label with config bundle hash

andrewhowdencom commented 4 years ago

Go's profiling endpoints (or other mechanism to generate profiles from running applications) should be enabled, somehow. (Perhaps via signal? Unsure)

Related: https://github.com/tam7t/sigprof

andrewhowdencom commented 4 years ago

Need to reason through how to deal with A/B tests. Are they just "application configuration"?

Uniquely, there might be multiple different behaviors of an application depending on which customer (or other split mechanism) is happening. We need to know under which condition the error / degradation was encountered

andrewhowdencom commented 4 years ago

Semantic Conventions should be in a library somewhere, and reused as constants.

andrewhowdencom commented 4 years ago

Meaningful availability

https://youtu.be/7TY8RaolprI

andrewhowdencom commented 4 years ago

Golden Signals:

https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals