Open andrewhowdencom opened 4 years ago
Client libraries should be able to detect connections to dead upstreams (sql and so fourth) and terminate them. Keep alives by default, for example.
Service expressing degredation
SLOs (expressed on the wire as "x-violated-slo: foo,bar,baz")
Services might degrade gracefully. Clients should not know this (that's what makes it graceful), but telemetry should.
Given this, responses could be marked as "in violation of SLO" (i.e. degraded). This would be a header (x-violated-slo
) combined with a list of key pairs that describe the relevant slos (correctness,latency
)
This could also just be appended to trace information. Unsure of the name (violated-slo
is nasty) but :man_shrugging:
Graceful termination:
https://gist.github.com/ivan3bx/b0f14449803ce5b0aa72afaa1dfc75e1
Should gracefully terminate and stuff.
Retries w. Exponential Backoff + Jitter (within gRPC):
Apparently this is built in to gRPC. See spec definition at:
https://github.com/grpc/proposal/blob/master/A6-client-retries.md
And an example (apparently) at:
https://github.com/grpc/grpc-go/tree/master/examples/features/retry
Fallbacks:
Not sure yet.
Bulkheading
Network APIs that are just slightly jerks but also that clients are designed to handle (that is, perma chaos testing. Max life time of connections, for example)
for gRPC:
Default rate limits — not to protect service health, but rather cash / investment / discourage "abusing" the service.
Timeouts — allow the operation to be completely disabled with a timeout of 0. It should be considered an "instant timeout".
Timeouts should also be able to be adjusted dynamically and quickly. (all configurations should be)
Should try and proactively "fail fast" (especially on a per replicas basis) as this maximizes time for retries.
Additionally, SLOs for parent service should factor in retries for client service.
Permanent chaos. Database failovers, injection of latency, blackholing single nodes, SIGKILLing random nodes (including data nodes), deleting replicas, changing addresses (such as DNS)
Rate limit the number of entities within $SERVICE by $IDENTITY. Use cases are especially interesting Re. customer oriented actions (i.e. number of addresses a given customer might have), but also within service to service.
Fallback strategies When creating a service it is important that we express not only when we are healthy, but also when we are not healthy. That allows users to proactively switch their own services into a failure handling service.
This includes things like: