status-im / status-go

The Status module that consumes go-ethereum
https://status.im
Mozilla Public License 2.0
727 stars 249 forks source link

RPC client circuit breaking logic improvements #5854

Open friofry opened 2 months ago

friofry commented 2 months ago

tl;dr

Problems with the current implementation:

Table of content

Providers configuration

We transform WalletSecretConfig (set on the client side) via default_networks.go setRPCs() to params.Networks:

func setRPCs(networks []params.Network, request *requests.WalletSecretsConfig) []params.Network {
    var networksWithRPC []params.Network
    for _, n := range networks {
        if request.InfuraToken != "" {
            if strings.Contains(n.RPCURL, "infura") {
                n.RPCURL += request.InfuraToken
            }
            if strings.Contains(n.FallbackURL, "infura") {
                n.FallbackURL += request.InfuraToken
            }
        }
                //... the same logic for grove, ganache

Here are some of the issues:

Suggestion

  1. Provider URLs should be excluded from the Network to a separate ProviderConfig. And there should be 1 to many (Network - *ProviderConfig).
  2. Allow different API tokens for different chainIDs
  3. There should be an explicit priority for the provider (or if the priorities are equal, then pick a random one)
  4. We should use all available providers in the circuit breaker when making RPC call

This will make this logic more predictable and as a bonus it will make it easier to implement a user settings UI for the custom tokens (e.g. if our providers are banned in their country by DNS).

Network type

Network has 6 RPC urls.

    DefaultRPCURL          string          `json:"defaultRpcUrl"`      // proxy rpc url
    DefaultFallbackURL     string          `json:"defaultFallbackURL"` // proxy fallback url
    RPCURL                 string          `json:"rpcUrl"`
    OriginalRPCURL         string          `json:"originalRpcUrl"`
    FallbackURL            string          `json:"fallbackURL"`
    OriginalFallbackURL    string          `json:"originalFallbackURL"`

Currently we are using 4 of them. And performing RPC attempts in the following order via ethclient.Client via circuit_breaker's command with 4 functors (or less)

Here are some issues:

Suggestion:

networks.go

func (nm *Manager) Find(chainID uint64) *params.Network {
    networks, err := newNetworksQuery().filterChainID(chainID).exec(nm.db)
    if len(networks) != 1 || err != nil {
        return nil
    }
    setDefaultRPCURL(networks, nm.configuredNetworks)
    return networks[0]
}

Suggestion: remove setDefaultRPCURL from these methods

RPC client

Currently we are using both RPS limiter to make sure that we are not hitting limits in chain/client.go

func (c *ClientWithFallback) makeCall(ctx context.Context, ethClients []*EthClient, f func(client *EthClient) (interface{}, error)) (interface{}, error) {
    if c.commonLimiter != nil {
        if allow, err := c.commonLimiter.Allow(c.tag); !allow {
            return nil, fmt.Errorf("tag=%s, %w", c.tag, err)
        }

        if allow, err := c.commonLimiter.Allow(c.groupTag); !allow {
            return nil, fmt.Errorf("groupTag=%s, %w", c.groupTag, err)
        }
    }

    c.LastCheckedAt = time.Now().Unix()

    cmd := circuitbreaker.NewCommand(ctx, nil)
    for _, provider := range ethClients {
        provider := provider
        cmd.Add(circuitbreaker.NewFunctor(func() ([]interface{}, error) {
            err := provider.limiter.WaitForRequestsAvailability(1)
            ...

and circuit_breaker.go to try multiple providers before

func (cb *CircuitBreaker) Execute(cmd *Command) CommandResult {
    if cmd == nil || cmd.IsEmpty() {
        return CommandResult{err: fmt.Errorf("command is nil or empty")}
    }

    var result CommandResult
    ctx := cmd.ctx
    if ctx == nil {
        ctx = context.Background()
    }

    for i, f := range cmd.functors {
        if cmd.cancel {
            break
        }

        var err error
        // if last command, execute without circuit
        if i == len(cmd.functors)-1 {
            res, execErr := f.exec()
            err = execErr
            if err == nil {
                result = CommandResult{res: res}
            }
        } else {
            circuitName := f.circuitName
            if cb.circuitNameHandler != nil {
                circuitName = cb.circuitNameHandler(circuitName)
            }

            if hystrix.GetCircuitSettings()[circuitName] == nil {
                hystrix.ConfigureCommand(circuitName, hystrix.CommandConfig{
                    Timeout:                cb.config.Timeout,
                    MaxConcurrentRequests:  cb.config.MaxConcurrentRequests,
                    RequestVolumeThreshold: cb.config.RequestVolumeThreshold,
                    SleepWindow:            cb.config.SleepWindow,
                    ErrorPercentThreshold:  cb.config.ErrorPercentThreshold,
                })
            }

            err = hystrix.DoC(ctx, circuitName, func(ctx context.Context) error {

So first we pass a functor to hystrix.DoC() which checks provider.limiter.WaitForRequestsAvailability(1) and may return ErrRequestsOverLimit without even calling the RPC provider, and it may increase the error percentage for circuit_breaker.

hystrix will record the response status of each service call, after the error rate reaches the threshold, the breaker will be open, and the fallback logic will execute before the breaker status changes back to closed

here is the config we're using:

    cbConfig := circuitbreaker.Config{
        Timeout:               20000, // is how long to wait for command to complete, in milliseconds
        MaxConcurrentRequests: 100, // is how many commands of the same type can run at the same time
        SleepWindow:           300000, //  is how long, in milliseconds, to wait after a circuit opens before testing for recovery
        ErrorPercentThreshold: 25, // causes circuits to open once the rolling measure of errors exceeds this percent of requests
    }

Here are some things that can be improved:

Suggestions:

Reporting provider state

func (c *ClientWithFallback) toggleConnectionState(err error) {
    connected := true
    if err != nil {
        if !isNotFoundError(err) && !isVMError(err) && !errors.Is(err, ErrRequestsOverLimit) && !errors.Is(err, context.Canceled) {
            log.Warn("Error not in chain call", "error", err, "chain", c.ChainID)
            connected = false
        } else {
            log.Warn("Error in chain call", "error", err)
        }
    }
    c.SetIsConnected(connected)
}
        feed.Send(walletevent.Event{
            Type:     EventBlockchainStatusChanged,
            Accounts: []common.Address{},
            Message:  string(encodedmessage),
            At:       time.Now().Unix(),
            ChainID:  chainID,
        })

When provider is not "connected" client gets EventBlockchainStatusChanged event with "down" message. Mobile just logs the event (previously there was a toast). Desktop has more complex logic and keeps track on the last success timestamp and

proc getStateValue(message: string): connection_status_backend.StateValue =
    if message == "down":
      return connection_status_backend.StateValue.Disconnected
    elif message == "up":
      return connection_status_backend.StateValue.Connected
    else:
      return connection_status_backend.StateValue.Unknown

  proc getChainStatusTable(message: string): ConnectionStatusNotification =
    result = initCustomStatusNotification()

    let chainStatusTable = parseJson(message)
    if chainStatusTable.kind != JNull:
      for k, v in chainStatusTable.pairs:
        result[k] = connection_status_backend.initConnectionState(
          value = getStateValue(v.getStr)
        )

  proc getChainIdsDown(self: Service, chainStatusTable: ConnectionStatusNotification): (bool, bool, seq[int], int) =
    var allKnown: bool = true
    var allDown: bool = true
    var chaindIdsDown: seq[int] = @[]
    var lastSuccessAt: int = connection_status_backend.INVALID_TIMESTAMP # latest succesful connectinon between the down chains

    let allChainIds = self.networkService.getCurrentNetworksChainIds()
    for id in allChainIds:
      if chainStatusTable.hasKey($id) and chainStatusTable[$id].value != connection_status_backend.StateValue.Unknown:
        if chainStatusTable[$id].value == connection_status_backend.StateValue.Connected:
          allDown = false
        else:
          chaindIdsDown.add(id)
          lastSuccessAt = max(lastSuccessAt, chainStatusTable[$id].lastSuccessAt)
      else:
        allKnown = false

    return (allKnown, allDown, chaindIdsDown, lastSuccessAt)

Errors are contcatenated

func accumulateCommandError(result CommandResult, circuitName string, err error) CommandResult {
    // Accumulate errors
    if result.err != nil {
        result.err = fmt.Errorf("%w, %s.error: %w", result.err, circuitName, err)
    } else {
        result.err = fmt.Errorf("%s.error: %w", circuitName, err)
    }
    return result
}

Here are some issues:

Suggestions:

Links:

alaibe commented 2 months ago

It's not clear why we need OriginalRPCURL, OriginalFallbackURL in this datatype. As user needs to be able to revert to original rpc url after changing them

It would be more reliable to introduce RPS limiting logic on the status proxy side. I would even most of it should be move to the proxy itself.

From a status go point of view we should only have: rpc url and original one (maybe in constants) The proxy should be the one handling the fallback etc.

friofry commented 2 months ago

@dlipicar what do you think of the proposed changes?

friofry commented 2 months ago

It's not clear why we need OriginalRPCURL, OriginalFallbackURL in this datatype. As user needs to be able to revert to original rpc url after changing them

It would be more reliable to introduce RPS limiting logic on the status proxy side. I would even most of it should be move to the proxy itself.

From a status go point of view we should only have: rpc url and original one (maybe in constants) The proxy should be the one handling the fallback etc.

Removing both rps_limiter and circuit_breaker and moving the provider logic to the proxy sounds better. (Basically, this is what the proxy design pattern means)

But in this case we still should probably:

alaibe commented 2 months ago

Agree with your 2 points and: Allow users to override the web3 client with their own urls/credentials is already the case on desktop. That would simplify the UI :)