Open andrewhowdencom opened 4 years ago
Tracing:
Should include:
application.version=""
deployment.id="uuid/hash" # Deployment also include configuration.
Given:
span.kind=client
span.kind=server
There should be
Sample problems might be:
Given a feature flag:
Given an A/B Test:
Given:
span.kind=client
There should be:
net.ip.address
net.ip.src.port
,net.ip.dst.port
, net.transport.protocol
net.session.tls.version
(Or so fourth. Connection semantics)
Success criteira
Can signals (SIGTERM, SIGINT) be modelled as RPCs? At any rate, need to figure out whether to log them or trace them. Probably trace
Given a connection pool, need to model:
Given an alert, it should be possible to collect depth first snapshots such as sampling osquery snapshots. Helps solve expense issues.
Things like:
Given application exit (either due to panic or signal or so), cancel all traces and so and mark thema s error
We should make clients a first class primitive within the server scope. Catches changes we anticipate, but clients do not -- things like API breakages.
Smth like a "client telemetry library" would be interesting (or just adoption of a standard format at which metrics will be surfaced to the service).
This has been done through the peer.*
and span.kind
tags.
Should monitor responses for dependencies. FOr example, if this service is dependent on Google Maps (or so), monitor the response rate for Google Maps.
This is true for all clients, unless there's a shared overlap they can consume.
Need language runtime specifics; things like:
It'd be interesting to see if it's possible to create additional exporters for metrics/tracing/logs that describes when a given thing is used (especially metrics, which are otherwise hard to discover) and what it means.
Rendering it to markdown and adding it to TELEMETRY.md (for example) would be dope.
a URI that resolves (i.e. tags.dt.o.littleman.co/${TAG_NAME}) resolves the issue associated with how to describe this stuff.
dt→ distributed tracing o → observability tags ... well, → tags
Short TLDs are a big ugly though.
Time series should have label with config bundle hash
Go's profiling endpoints (or other mechanism to generate profiles from running applications) should be enabled, somehow. (Perhaps via signal? Unsure)
Related: https://github.com/tam7t/sigprof
Need to reason through how to deal with A/B tests. Are they just "application configuration"?
Uniquely, there might be multiple different behaviors of an application depending on which customer (or other split mechanism) is happening. We need to know under which condition the error / degradation was encountered
Semantic Conventions should be in a library somewhere, and reused as constants.
Meaningful availability
Use OpenTelemetry. Logs are not really for production diagnostics.
Where OpenTelemetry spans are marked as error, be sure to annotate the span with some log line that indicates why it was an error.
Additionally, review upstream guidance such as: