friofry commented 1 day ago

tl;dr

Problems with the current implementation:

It's not clear which providers the client will use and in which order, and in fact some of them might be ignored even if the token is specified in the config.
RPS limiter logic could cause unnecessary throttling without actually preventing us from hitting the RPC limit.
Circuit breakers may not work as expected in some cases.
Clients do not have enough information to properly display provider errors

Providers configuration

We transform WalletSecretConfig (set on the client side) via default_networks.go setRPCs() to params.Networks:

func setRPCs(networks []params.Network, request *requests.WalletSecretsConfig) []params.Network {
    var networksWithRPC []params.Network
    for _, n := range networks {
        if request.InfuraToken != "" {
            if strings.Contains(n.RPCURL, "infura") {
                n.RPCURL += request.InfuraToken
            }
            if strings.Contains(n.FallbackURL, "infura") {
                n.FallbackURL += request.InfuraToken
            }
        }
                //... the same logic for grove, ganache

Here are some of the issues:

We have more than 10 API tokens configurable in WalletSecretConfig and only 3 of them are actually respected by the setRPCs function. It would be great to have other providers available for RPC calls. For example, the Alchemy.
setRPCs relies on string comparison in URLs which is unreliable.
This logic doesn't allow you to have different tokens for different chains.

Suggestion

Provider URLs should be excluded from the Network to a separate ProviderConfig. And there should be 1 to many (Network - *ProviderConfig).
Allow different API tokens for different chainIDs
There should be an explicit priority for the provider (or if the priorities are equal, then pick a random one)
We should use all available providers in the circuit breaker when making RPC call

This will make this logic more predictable and as a bonus it will make it easier to implement a user settings UI for the custom tokens (e.g. if our providers are banned in their country by DNS).

Network type

Network has 6 RPC urls.

    DefaultRPCURL          string          `json:"defaultRpcUrl"`      // proxy rpc url
    DefaultFallbackURL     string          `json:"defaultFallbackURL"` // proxy fallback url
    RPCURL                 string          `json:"rpcUrl"`
    OriginalRPCURL         string          `json:"originalRpcUrl"`
    FallbackURL            string          `json:"fallbackURL"`
    OriginalFallbackURL    string          `json:"originalFallbackURL"`

Currently we are using 4 of them. And performing RPC attempts in the following order via ethclient.Client via circuit_breaker's command with 4 functors (or less)

proxy (DefaultRPCURL)
proxy-fallback (DefaultFallbackURL)
main (RPCURL)
fallback (FallbackURL)

Here are some issues:

It's not clear why we need OriginalRPCURL, OriginalFallbackURL in this datatype.
Provider URL and auth params are stored in separate types.
Provider priority is implicit and not flexible.

Suggestion:

As mentioned above, there should be any number of providers that can be set up for a particular chain
The provider config should have urls, auth params and priority

networks.go

There are few read-only methods (Find, GetAll) in networks.go that making unobvious side effects like setDefaultRPCURL.

func (nm *Manager) Find(chainID uint64) *params.Network {
    networks, err := newNetworksQuery().filterChainID(chainID).exec(nm.db)
    if len(networks) != 1 || err != nil {
        return nil
    }
    setDefaultRPCURL(networks, nm.configuredNetworks)
    return networks[0]
}

Suggestion: remove setDefaultRPCURL from these methods

RPC client

Currently we are using both RPS limiter to make sure that we are not hitting limits in chain/client.go

func (c *ClientWithFallback) makeCall(ctx context.Context, ethClients []*EthClient, f func(client *EthClient) (interface{}, error)) (interface{}, error) {
    if c.commonLimiter != nil {
        if allow, err := c.commonLimiter.Allow(c.tag); !allow {
            return nil, fmt.Errorf("tag=%s, %w", c.tag, err)
        }

        if allow, err := c.commonLimiter.Allow(c.groupTag); !allow {
            return nil, fmt.Errorf("groupTag=%s, %w", c.groupTag, err)
        }
    }

    c.LastCheckedAt = time.Now().Unix()

    cmd := circuitbreaker.NewCommand(ctx, nil)
    for _, provider := range ethClients {
        provider := provider
        cmd.Add(circuitbreaker.NewFunctor(func() ([]interface{}, error) {
            err := provider.limiter.WaitForRequestsAvailability(1)
            ...

and circuit_breaker.go to try multiple providers before

func (cb *CircuitBreaker) Execute(cmd *Command) CommandResult {
    if cmd == nil || cmd.IsEmpty() {
        return CommandResult{err: fmt.Errorf("command is nil or empty")}
    }

    var result CommandResult
    ctx := cmd.ctx
    if ctx == nil {
        ctx = context.Background()
    }

    for i, f := range cmd.functors {
        if cmd.cancel {
            break
        }

        var err error
        // if last command, execute without circuit
        if i == len(cmd.functors)-1 {
            res, execErr := f.exec()
            err = execErr
            if err == nil {
                result = CommandResult{res: res}
            }
        } else {
            circuitName := f.circuitName
            if cb.circuitNameHandler != nil {
                circuitName = cb.circuitNameHandler(circuitName)
            }

            if hystrix.GetCircuitSettings()[circuitName] == nil {
                hystrix.ConfigureCommand(circuitName, hystrix.CommandConfig{
                    Timeout:                cb.config.Timeout,
                    MaxConcurrentRequests:  cb.config.MaxConcurrentRequests,
                    RequestVolumeThreshold: cb.config.RequestVolumeThreshold,
                    SleepWindow:            cb.config.SleepWindow,
                    ErrorPercentThreshold:  cb.config.ErrorPercentThreshold,
                })
            }

            err = hystrix.DoC(ctx, circuitName, func(ctx context.Context) error {

So first we pass a functor to hystrix.DoC() which checks provider.limiter.WaitForRequestsAvailability(1) and may return ErrRequestsOverLimit without even calling the RPC provider, and it may increase the error percentage for circuit_breaker.

hystrix will record the response status of each service call, after the error rate reaches the threshold, the breaker will be open, and the fallback logic will execute before the breaker status changes back to closed

here is the config we're using:

    cbConfig := circuitbreaker.Config{
        Timeout:               20000, // is how long to wait for command to complete, in milliseconds
        MaxConcurrentRequests: 100, // is how many commands of the same type can run at the same time
        SleepWindow:           300000, //  is how long, in milliseconds, to wait after a circuit opens before testing for recovery
        ErrorPercentThreshold: 25, // causes circuits to open once the rolling measure of errors exceeds this percent of requests
    }

Here are some things that can be improved:

Complex logic: using both RPS limiter(provider.limiter.ReduceLimit()) and opening circuit breaker logic at the same time can be confusing. Circuit_breaker can become open because RPS limiter returns ErrRequestsOverLimit
RPS limiter:
- We have a common limit in ClientWithFallback for all providers.
- And there is also a provider.limiter which does not protect us from reaching the Provider RPS limit as we do not know the number of client and the remaining RPS limit on the service.
Circuit breaker
- Currently we don't run the RPC call through hystrix.DoC if it is the last resort. It may make sense to let hystrix collect statistics. Here is why this is unpredictable: if circuitNameHandler is used to override the circuit name, then the limits may change. Or if the list of eth clients is updated, it may be useful to have previous statistics for the provider.
- It looks like the circuit breaker doesn't detect connection problems and still imposes a 5 minute sleep window. (needs double checking). But we have a hack in GethStatusBackend::ConnectionChange to reset it explicitly hystrix.Flush().

Suggestions:

Remove the RPS limiter and simplify the logic.
It would be more reliable to introduce RPS limiting logic on the status proxy side.
Remove the hystrix.Flush() hack.
The last resort provider should try to run the command through hystrix first. And if that fails because the circuit breaker is open, then run it directly. In this case we will collect the stats correctly.

Reporting provider state

func (c *ClientWithFallback) toggleConnectionState(err error) {
    connected := true
    if err != nil {
        if !isNotFoundError(err) && !isVMError(err) && !errors.Is(err, ErrRequestsOverLimit) && !errors.Is(err, context.Canceled) {
            log.Warn("Error not in chain call", "error", err, "chain", c.ChainID)
            connected = false
        } else {
            log.Warn("Error in chain call", "error", err)
        }
    }
    c.SetIsConnected(connected)
}

        feed.Send(walletevent.Event{
            Type:     EventBlockchainStatusChanged,
            Accounts: []common.Address{},
            Message:  string(encodedmessage),
            At:       time.Now().Unix(),
            ChainID:  chainID,
        })

When provider is not "connected" client gets EventBlockchainStatusChanged event with "down" message. Mobile just logs the event (previously there was a toast). Desktop has more complex logic and keeps track on the last success timestamp and

proc getStateValue(message: string): connection_status_backend.StateValue =
    if message == "down":
      return connection_status_backend.StateValue.Disconnected
    elif message == "up":
      return connection_status_backend.StateValue.Connected
    else:
      return connection_status_backend.StateValue.Unknown

  proc getChainStatusTable(message: string): ConnectionStatusNotification =
    result = initCustomStatusNotification()

    let chainStatusTable = parseJson(message)
    if chainStatusTable.kind != JNull:
      for k, v in chainStatusTable.pairs:
        result[k] = connection_status_backend.initConnectionState(
          value = getStateValue(v.getStr)
        )

  proc getChainIdsDown(self: Service, chainStatusTable: ConnectionStatusNotification): (bool, bool, seq[int], int) =
    var allKnown: bool = true
    var allDown: bool = true
    var chaindIdsDown: seq[int] = @[]
    var lastSuccessAt: int = connection_status_backend.INVALID_TIMESTAMP # latest succesful connectinon between the down chains

    let allChainIds = self.networkService.getCurrentNetworksChainIds()
    for id in allChainIds:
      if chainStatusTable.hasKey($id) and chainStatusTable[$id].value != connection_status_backend.StateValue.Unknown:
        if chainStatusTable[$id].value == connection_status_backend.StateValue.Connected:
          allDown = false
        else:
          chaindIdsDown.add(id)
          lastSuccessAt = max(lastSuccessAt, chainStatusTable[$id].lastSuccessAt)
      else:
        allKnown = false

    return (allKnown, allDown, chaindIdsDown, lastSuccessAt)

Errors are contcatenated

func accumulateCommandError(result CommandResult, circuitName string, err error) CommandResult {
    // Accumulate errors
    if result.err != nil {
        result.err = fmt.Errorf("%w, %s.error: %w", result.err, circuitName, err)
    } else {
        result.err = fmt.Errorf("%s.error: %w", circuitName, err)
    }
    return result
}

Here are some issues:

There is no distinction between reaching the rps limit, network connectivity and partial unavailability of some providers. And it's not enough information to display error properly on the client side. This leads to annoying notifications.
The desktop client has some logic based on the lastSuccessAt timestamp which is not shared with the mobile client.
Circuit breaker accumulates errors strings, and error type is lost. So if 2 providers are hit the RPS limit and 2 are unreachable, it's unclear whether we should notify the user that we've reached the RPC limit, or whether it's a general error. What we would recommend to him is to: fix the network connection, or he should override the RPC credentials, or it's an internal status proxy error.

Suggestions:

Extend the event with additional information: error type, last success, backoff period
Allow forcefully reseting the circuit breaker if it's triggered from the UI
Make more highlevel API to make it easier implement the required UI: https://www.figma.com/design/xLs1KYmF4e6WwRTZVJKeUK/Wallet?node-id=17027-555628&t=UgBw1sIEjFGO97RK-1

Links:

alaibe commented 1 day ago

It's not clear why we need OriginalRPCURL, OriginalFallbackURL in this datatype. As user needs to be able to revert to original rpc url after changing them

It would be more reliable to introduce RPS limiting logic on the status proxy side. I would even most of it should be move to the proxy itself.

From a status go point of view we should only have: rpc url and original one (maybe in constants) The proxy should be the one handling the fallback etc.

friofry commented 1 day ago

@dlipicar what do you think of the proposed changes?

friofry commented 1 day ago

It's not clear why we need OriginalRPCURL, OriginalFallbackURL in this datatype. As user needs to be able to revert to original rpc url after changing them

It would be more reliable to introduce RPS limiting logic on the status proxy side. I would even most of it should be move to the proxy itself.

From a status go point of view we should only have: rpc url and original one (maybe in constants) The proxy should be the one handling the fallback etc.

Removing both rps_limiter and circuit_breaker and moving the provider logic to the proxy sounds better. (Basically, this is what the proxy design pattern means)

But in this case we still should probably:

Return more explicit errors in the events sent to the client
Allow users to override the web3 client with their own urls/credentials

alaibe commented 1 day ago

Agree with your 2 points and: Allow users to override the web3 client with their own urls/credentials is already the case on desktop. That would simplify the UI :)

status-im / status-go

RPC client circuit breaking logic improvements #5854

tl;dr

Table of content

Providers configuration

Suggestion

Network type

Suggestion:

networks.go

RPC client

Suggestions:

Reporting provider state

Suggestions: