microsoft / reverse-proxy

A toolkit for developing high-performance HTTP reverse proxy applications.
https://microsoft.github.io/reverse-proxy
MIT License
8.6k stars 843 forks source link

IProxyStateLookup and ClusterState.DestinationsState.AvailableDestinations - behavior change after upgrade from 1.1.1 -> 2.1.0 #2602

Closed 4865783a5d closed 2 months ago

4865783a5d commented 2 months ago

Describe the bug

We are using IProxyStateLookup in a Asp.Net Core Healthcheck to check if a specific cluster is healthy using the cluster id as query param in a Azure environment with Azure Front Door and YARP as regional proxy. After upgrading from .NET 6 and YARP 1.1.1 to .NET 8 and YARP 2.1.0, AvailableDestinations contains an unhealthy destination, where in the previous version, that would not be the case. We checked breaking changes by going through the release notes and couldn't find anything.

To Reproduce

The configuration uses a simple json-server

"Configuration": {
  "Routes": {
    "json-server": {
      "ClusterId": "json-server",
      "Match": {
        "Hosts": [ "localhost" ]
      }
    }
  },
  "Clusters": {
    "json-server": {
      "HealthCheck": {
        "Active": {
          "Enabled": "true",
          "Interval": "00:00:05",
          "Timeout": "00:00:03",
          "Policy": "ConsecutiveFailures",
          "Path": "/health"
        }
      },
      "Destinations": {
        "local": {
          "Address": "http://localhost:3000"
        }
      },
      "Metadata": {
        "ConsecutiveFailuresHealthPolicy.Threshold": "2"
      }
    }
  }
}

When calling the exposed healthcheck with http://localhost:7014/health?clusterId=json-server we get the following results (The cluster is not running at that time):

With YARP 2.1.0 and .NET8 image

With YARP 1.1.0 and .NET6 image

Further technical details

Relevant health check code in IHealthcheck service

if (_httpContextAccessor.HttpContext == null)
    throw new InvalidOperationException("HttpContext is null.");

var httpRequestFeature = _httpContextAccessor.HttpContext.GetHttpRequestFeature();
var clusters = _proxyStateLookup.GetClusters();

if (!clusters.Any())
    return Task.FromResult(new HealthCheckResult(HealthStatus.Unhealthy, HealthProbeErrorDescriber.NoConfig));

if (TryGetClusterId(httpRequestFeature.QueryString, out var clusterId))
{
    if (!_proxyStateLookup.TryGetCluster(clusterId, out var clusterState))
    {
        return Task.FromResult(new HealthCheckResult(HealthStatus.Unhealthy,
            HealthProbeErrorDescriber.FormatClusterNotFound(clusterId)));
    }

    return Task.FromResult(clusterState.DestinationsState.AvailableDestinations.Count > 0
        ? HealthCheckResult.Healthy()
        : new(HealthStatus.Unhealthy, HealthProbeErrorDescriber.FormatUnhealthyCluster(clusterId)));
}
4865783a5d commented 2 months ago

After reviewing the YARP source and re-checking release notes, I found that https://github.com/microsoft/reverse-proxy/pull/2171 is the reason we're seeing this behavior.

Relevant code in HealthyOrPanicDestinationsPolicy:

 public override IReadOnlyList<DestinationState> GetAvailalableDestinations(ClusterConfig config, IReadOnlyList<DestinationState> allDestinations)
 {
     var availableDestinations = base.GetAvailalableDestinations(config, allDestinations);
     return availableDestinations.Count > 0 ? availableDestinations : allDestinations;
 }

While understandable, it has quite an impact on the operating side of things. We typically want to be informed if a cluster is unhealthy for the sake of alerting / monitoring. I personally would have classified that as breaking change :) - as in, it broke our monitoring.

We fixed the issue with a IProxyConfigFilter to set the 1.1.1 default policy on all cluster configs.

MihaZupan commented 2 months ago

Sorry about that, you can see that the PR https://github.com/microsoft/reverse-proxy/pull/2171 was marked as a breaking change, but we forgot to call out breaking changes in the 2.1.0 preview 1 release notes.

I've updated the release notes for that release now.