Resiliance - Githubissues

andrewhowdencom commented 4 years ago

Fallback strategies When creating a service it is important that we express not only when we are healthy, but also when we are not healthy. That allows users to proactively switch their own services into a failure handling service.

This includes things like:

retry strategies;
transaction ID (for example),
timeouts,
circuit breakers etc.
bucketed rate limiting
traffic classes (batch versus synchronous)

andrewhowdencom commented 4 years ago

https://github.com/Netflix/concurrency-limits#overview

andrewhowdencom commented 4 years ago

https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/

andrewhowdencom commented 4 years ago

Client libraries should be able to detect connections to dead upstreams (sql and so fourth) and terminate them. Keep alives by default, for example.

andrewhowdencom commented 4 years ago

Service expressing degredation

SLOs (expressed on the wire as "x-violated-slo: foo,bar,baz")

Services might degrade gracefully. Clients should not know this (that's what makes it graceful), but telemetry should.

Given this, responses could be marked as "in violation of SLO" (i.e. degraded). This would be a header (x-violated-slo) combined with a list of key pairs that describe the relevant slos (correctness,latency)

This could also just be appended to trace information. Unsure of the name (violated-slo is nasty) but :man_shrugging:

andrewhowdencom commented 4 years ago

Graceful termination:

https://gist.github.com/ivan3bx/b0f14449803ce5b0aa72afaa1dfc75e1

Should gracefully terminate and stuff.

andrewhowdencom commented 4 years ago

Retries w. Exponential Backoff + Jitter (within gRPC):

Apparently this is built in to gRPC. See spec definition at:

https://github.com/grpc/proposal/blob/master/A6-client-retries.md

And an example (apparently) at:

https://github.com/grpc/grpc-go/tree/master/examples/features/retry

Fallbacks:

Not sure yet.

andrewhowdencom commented 4 years ago

Bulkheading

andrewhowdencom commented 4 years ago

Network APIs that are just slightly jerks but also that clients are designed to handle (that is, perma chaos testing. Max life time of connections, for example)

for gRPC:

MAX_CONNECTION_AGE (30m)¹
MAX_CONNECTION_AGE_GRACE (5m)¹
Client side Keepalive¹
Use wait for ready¹

https://www.youtube.com/watch?v=Naonb2XD_2Q

andrewhowdencom commented 4 years ago

Default rate limits — not to protect service health, but rather cash / investment / discourage "abusing" the service.

andrewhowdencom commented 4 years ago

Timeouts — allow the operation to be completely disabled with a timeout of 0. It should be considered an "instant timeout".

Timeouts should also be able to be adjusted dynamically and quickly. (all configurations should be)

andrewhowdencom commented 4 years ago

Should try and proactively "fail fast" (especially on a per replicas basis) as this maximizes time for retries.

Additionally, SLOs for parent service should factor in retries for client service.

andrewhowdencom commented 4 years ago

Permanent chaos. Database failovers, injection of latency, blackholing single nodes, SIGKILLing random nodes (including data nodes), deleting replicas, changing addresses (such as DNS)

andrewhowdencom commented 3 years ago

Rate limit the number of entities within $SERVICE by $IDENTITY. Use cases are especially interesting Re. customer oriented actions (i.e. number of addresses a given customer might have), but also within service to service.

littlemanco / the-golden-path.net

Resiliance #15