quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
7.31k stars 298 forks source link

Added a circuit breaker layer #5134

Closed fulmicoton closed 1 week ago

fulmicoton commented 3 weeks ago

Adds a circuit breaker layer.

The piece that estimates whether the next request is likely to fail is extremely simplistic for the moment. It simply counter the number of errors (not taking in account successes) that happened in a given time window.

The reason is that for the moment, we want to use it for persist requests when the WAL is full. On airmail, the aggressive retry logic of the client was causing a massive grpc storm on the faulty indexer node, taking all of its CPU and preventing it from getting out of that state.

In this case, the error estimation logic is very simple, a full WAL guarantees that no further persist request will be successful for a little while.

Tested

github-actions[bot] commented 3 weeks ago

On SSD:

Average search latency is 1.01x that of the reference (lower is better).
Ref run id: 2185, ref commit: 7547ac38a021723ad4ef1df54583c146ad31a74b
Link

On GCS:

Average search latency is 1.06x that of the reference (lower is better).
Ref run id: 2186, ref commit: 7547ac38a021723ad4ef1df54583c146ad31a74b
Link