Support failing out requests to clusters

shangchengying commented 2 years ago

In order to make our infrastructure more resilient, we'd like to have the ability to failout clusters individually so that we can mitigate more concurrent cluster level issues.

A new FailoutConfig entry will be added to ClusterStoreProperites so that we can leverage existing Zookeeper watches to propagate failout signals to all clients watching for the cluster.
A new LoadBalancerClusterListener will be registered to SimpleLoadBalancerState to watch for the failout property changes by FailoutConfigProvider.
Each failout cluster will be managed by FailedoutClusterManager which handles adding registering watches to the peer clusters and warm up connections.
The failout config will be accessible via ClusterInfoProvider so that D2Clients can read the configs and perform reroutes.
A new D2ClientDelegator, FailoutClient, is created to handle the re-routing request of failed out clusters.
The actually decision of where to route the request will be determined by FailoutRedirectStrategy which needs to be provided as a D2ClientConfig.

Change has been tested end to end against a sandbox services.

shangchengying commented 2 years ago

Also, in case the failout feature needs to become (d2)service-level in future, we may want to consider leave the implementation extendible and avoid "cluster-specific" naming/assumptions? (for example, the keys in this map may include service names in future?)

Based on our current infra, we are unlikely to need d2 service level failout. Products are deployed at cluster level and it's unlikely that we will failout a service alone today. If in the future, when we do need to support that, I believe we will need additional changes as the config properties/parsing will be very different. Not sure what our timeline is for next-gen D2. We may not even get into service level failout before next-gen D2. I feel it's better to keep it simple for the moment.

bohhyang commented 2 years ago

Also, in case the failout feature needs to become (d2)service-level in future, we may want to consider leave the implementation extendible and avoid "cluster-specific" naming/assumptions? (for example, the keys in this map may include service names in future?)

Based on our current infra, we are unlikely to need d2 service level failout. Products are deployed at cluster level and it's unlikely that we will failout a service alone today. If in the future, when we do need to support that, I believe we will need additional changes as the config properties/parsing will be very different. Not sure what our timeline is for next-gen D2. We may not even get into service level failout before next-gen D2. I feel it's better to keep it simple for the moment.

Sure. Just wanted to mention it to give it a thought. Thanks.

linkedin / rest.li

Support failing out requests to clusters #777