twitter / finagle

A fault tolerant, protocol-agnostic RPC system
https://twitter.github.io/finagle
Apache License 2.0
8.78k stars 1.45k forks source link

How do you calculate your retry budget? #946

Open rafatbiin opened 1 year ago

rafatbiin commented 1 year ago

I was reading: https://finagle.github.io/blog/2016/02/08/retry-budgets/ and came across the default number of minRetriesPerSec and percentCanRetry . as I understand this number can vary from service to service. how do you calculate these two numbers with the following objective in mind?

  1. Your retry budget should be relaxed enough that it shouldn't block retries in a normal scenario.
  2. Your retry budget should be strict enough that it will safeguard against a retry storm.
csaltos commented 5 months ago

The values depends on your case, the size of servers, the number of connections and a lot of factors, normally you start with some conservative numbers and then you test the performance of your system an tune accordingly.

As a reference this is the configuration we are using at my company:

import com.twitter.conversions.DurationOps._
import com.twitter.finagle.Backoff
import com.twitter.finagle.Http
import com.twitter.finagle.ServiceFactory
import com.twitter.finagle.http
import com.twitter.finagle.service.ReqRep
import com.twitter.finagle.service.ResponseClass
import com.twitter.finagle.service.ResponseClassifier
import com.twitter.finagle.service.RetryBudget
import com.twitter.util.Duration
import com.twitter.util.Future
import com.twitter.util.Return
import com.twitter.util.StorageUnit
import com.twitter.util.Timer

val host = "test.com"
val url = "https://test.com/test1"
val totalRequestTimeout = 5.seconds
val referenceTimeout =
    Duration.fromMilliseconds(
      Math.max(1L, totalRequestTimeout.inMillis / 5L - 100L)
    )
initialRequestTimeout =
    Duration.fromMilliseconds(referenceTimeout.inMillis * 2L)
val retryRequestTimeout =
    Duration.fromMilliseconds(referenceTimeout.inMillis * 3L)
val maxResponseSizeInBytes = 10000000
val clientFactory = Http.client
      .withRequestTimeout(initialRequestTimeout)
      .withRetryBudget(RetryBudget())
      .withRetryBackoff(Backoff.exponentialJittered(1.second, backoff))
      .withResponseClassifier(
        customResponseClassifierOnErrors orElse http.service.HttpResponseClassifier.ServerErrorsAsFailures
      )
      .withMaxResponseSize(
          StorageUnit.fromBytes(maxResponseSizeInBytes)
        )
      .withTls(host)
val client = clientFactory.newClient("test.com:443")
val requestBuilder = http
      .RequestBuilder()
      .url(url)
      .addHeader(
        http.Fields.UserAgent,
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:50.0) Gecko/20100101 Firefox/50.0"
      ).addHeader(http.Fields.Host, host)

val request = requestBuilder.buildGet()
val response = httpClient.toResponse(request)
response