andrewhowdencom commented 4 years ago

Use OpenTelemetry. Logs are not really for production diagnostics.

Where OpenTelemetry spans are marked as error, be sure to annotate the span with some log line that indicates why it was an error.

Additionally, review upstream guidance such as:

Semantic conventions

andrewhowdencom commented 4 years ago

Relevant:

http://www.brendangregg.com/usemethod.html

andrewhowdencom commented 4 years ago

Line numbers for go programs:

andrewhowdencom commented 4 years ago

https://golang.org/pkg/net/http/pprof/

andrewhowdencom commented 4 years ago

Tracing:

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/resource/semantic_conventions/k8s.md

Should include:

application.version=""
deployment.id="uuid/hash" # Deployment also include configuration.

Given:

span.kind=client
span.kind=server

There should be

problem.uri
problem.title

Sample problems might be:

problems.api.littleman.co/network/socket-timeout

Given a feature flag:

Log the feature flag
Log the slice used

Given an A/B Test:

Log the test
Log the slice used

Given:

span.kind=client

There should be:

net.ip.address
net.ip.src.port,
net.ip.dst.port,
net.transport.protocol
net.session.tls.version

(Or so fourth. Connection semantics)

Success criteira

Can pin issue to specific network related condition (connection failure, socket timeout) based on trace.

Can signals (SIGTERM, SIGINT) be modelled as RPCs? At any rate, need to figure out whether to log them or trace them. Probably trace

Given a connection pool, need to model:

Time connection requested from the pool
Time connection was actually acquired
Time connection was in use

Given an alert, it should be possible to collect depth first snapshots such as sampling osquery snapshots. Helps solve expense issues.

Things like:

pcap
sysdig
bash

Given application exit (either due to panic or signal or so), cancel all traces and so and mark thema s error

andrewhowdencom commented 4 years ago

We should make clients a first class primitive within the server scope. Catches changes we anticipate, but clients do not -- things like API breakages.

Smth like a "client telemetry library" would be interesting (or just adoption of a standard format at which metrics will be surfaced to the service).

This has been done through the peer.* and span.kind tags.

andrewhowdencom commented 4 years ago

Should monitor responses for dependencies. FOr example, if this service is dependent on Google Maps (or so), monitor the response rate for Google Maps.

This is true for all clients, unless there's a shared overlap they can consume.

andrewhowdencom commented 4 years ago

Need language runtime specifics; things like:

GC collection
Open threads
... Dunno. Other stuff. Go looking here. Prometheus go exporters already have a lot I guess.

andrewhowdencom commented 4 years ago

Discoverability

In Repo

It'd be interesting to see if it's possible to create additional exporters for metrics/tracing/logs that describes when a given thing is used (especially metrics, which are otherwise hard to discover) and what it means.

Rendering it to markdown and adding it to TELEMETRY.md (for example) would be dope.

Cross Repo

a URI that resolves (i.e. tags.dt.o.littleman.co/${TAG_NAME}) resolves the issue associated with how to describe this stuff.

dt→ distributed tracing o → observability tags ... well, → tags

Short TLDs are a big ugly though.

andrewhowdencom commented 4 years ago

Time series should have label with config bundle hash

andrewhowdencom commented 4 years ago

Go's profiling endpoints (or other mechanism to generate profiles from running applications) should be enabled, somehow. (Perhaps via signal? Unsure)

andrewhowdencom commented 4 years ago

Need to reason through how to deal with A/B tests. Are they just "application configuration"?

Uniquely, there might be multiple different behaviors of an application depending on which customer (or other split mechanism) is happening. We need to know under which condition the error / degradation was encountered

andrewhowdencom commented 4 years ago

Semantic Conventions should be in a library somewhere, and reused as constants.

andrewhowdencom commented 4 years ago

Meaningful availability

https://youtu.be/7TY8RaolprI

andrewhowdencom commented 4 years ago

Golden Signals:

https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals

littlemanco / the-golden-path.net

Telemetry #13

Time connection was in use

bash

Discoverability

In Repo

Cross Repo