littlemanco / the-golden-path.net

A template for writing a new tool or service.
0 stars 0 forks source link

Resiliance #15

Open andrewhowdencom opened 4 years ago

andrewhowdencom commented 4 years ago

Fallback strategies When creating a service it is important that we express not only when we are healthy, but also when we are not healthy. That allows users to proactively switch their own services into a failure handling service.

This includes things like:

andrewhowdencom commented 4 years ago

https://github.com/Netflix/concurrency-limits#overview

See also,

https://github.com/platinummonkey/go-concurrency-limits

andrewhowdencom commented 4 years ago

https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/

andrewhowdencom commented 4 years ago

Client libraries should be able to detect connections to dead upstreams (sql and so fourth) and terminate them. Keep alives by default, for example.

andrewhowdencom commented 4 years ago

Service expressing degredation

SLOs (expressed on the wire as "x-violated-slo: foo,bar,baz")

Services might degrade gracefully. Clients should not know this (that's what makes it graceful), but telemetry should.

Given this, responses could be marked as "in violation of SLO" (i.e. degraded). This would be a header (x-violated-slo) combined with a list of key pairs that describe the relevant slos (correctness,latency)

This could also just be appended to trace information. Unsure of the name (violated-slo is nasty) but :man_shrugging:

andrewhowdencom commented 4 years ago

Graceful termination:

https://gist.github.com/ivan3bx/b0f14449803ce5b0aa72afaa1dfc75e1

Should gracefully terminate and stuff.

andrewhowdencom commented 4 years ago

Retries w. Exponential Backoff + Jitter (within gRPC):

Apparently this is built in to gRPC. See spec definition at:

https://github.com/grpc/proposal/blob/master/A6-client-retries.md

And an example (apparently) at:

https://github.com/grpc/grpc-go/tree/master/examples/features/retry

Fallbacks:

Not sure yet.

andrewhowdencom commented 4 years ago

Bulkheading

andrewhowdencom commented 4 years ago

Network APIs that are just slightly jerks but also that clients are designed to handle (that is, perma chaos testing. Max life time of connections, for example)

for gRPC:

  1. https://www.youtube.com/watch?v=Naonb2XD_2Q
andrewhowdencom commented 4 years ago

Default rate limits — not to protect service health, but rather cash / investment / discourage "abusing" the service.

andrewhowdencom commented 4 years ago

Timeouts — allow the operation to be completely disabled with a timeout of 0. It should be considered an "instant timeout".

Timeouts should also be able to be adjusted dynamically and quickly. (all configurations should be)

andrewhowdencom commented 4 years ago

Should try and proactively "fail fast" (especially on a per replicas basis) as this maximizes time for retries.

Additionally, SLOs for parent service should factor in retries for client service.

andrewhowdencom commented 4 years ago

Permanent chaos. Database failovers, injection of latency, blackholing single nodes, SIGKILLing random nodes (including data nodes), deleting replicas, changing addresses (such as DNS)

andrewhowdencom commented 3 years ago

Rate limit the number of entities within $SERVICE by $IDENTITY. Use cases are especially interesting Re. customer oriented actions (i.e. number of addresses a given customer might have), but also within service to service.