Request to have metrics emitted on health check failures

timmydo commented 1 year ago

What should we add or change to make your life better?

Requesting metrics in some of the default classes such as: https://github.com/microsoft/reverse-proxy/blob/04b1fe8d8cd0f25b814805c2fb91c68fb26c7f29/src/ReverseProxy/Health/DestinationHealthUpdater.cs#L43

https://github.com/microsoft/reverse-proxy/blob/04b1fe8d8cd0f25b814805c2fb91c68fb26c7f29/src/ReverseProxy/Health/ActiveHealthCheckMonitor.cs#L197C31-L197C31

https://learn.microsoft.com/en-us/dotnet/core/diagnostics/metrics-collection

Specifically, today, I'm looking for a way to have alerts go off when our health checks fail without reimplementing YARP's default implementations. More generally it would be nice if YARP had metrics to complement the Log statements.

Why is this important to you?

We are able to set up alerts on metrics but not on individual log messages. Metrics can alert the on-call engineers to look at the logs for further information.

timmydo commented 1 year ago

here is a quick list I put together by searching for Log. and removing some that didn't look relevant:

src\ReverseProxy\Configuration\ConfigProvider\ConfigurationConfigProvider.cs:
   70:             Log.LoadData(_logger);

   88:                 Log.ConfigurationDataConversionFailed(_logger, ex);

  110:                 Log.ErrorSignalingChange(_logger, ex);

src\ReverseProxy\Forwarder\ForwarderHttpClientFactory.cs:
  40:             Log.ClientReused(_logger, context.ClusterId);

  60:         Log.ClientCreated(_logger, context.ClusterId);

src\ReverseProxy\Forwarder\ForwarderMiddleware.cs:
  50:             Log.NoAvailableDestinations(_logger, cluster.ClusterId);

  62:             Log.MultipleDestinationsAvailable(_logger, cluster.ClusterId);

  79:             ForwarderTelemetry.Log.ForwarderInvoke(cluster.ClusterId, route.Config.RouteId, destination.DestinationId);

src\ReverseProxy\Forwarder\HttpForwarder.cs:
  150:                     Log.NotProxying(_logger, context.Response.StatusCode);

  154:                 Log.Proxying(_logger, destinationRequest, isStreamingRequest);

  171:                         Log.RetryingWebSocketDowngradeNoConnect(_logger);

  179:                         Log.RetryingWebSocketDowngradeNoHttp2(_logger);

  214:             Log.ResponseReceived(_logger, destinationResponse);

  490:                 Log.InvalidSecWebSocketKeyHeader(_logger, key);

  922:         Log.ErrorProxying(_logger, error, ex);

src\ReverseProxy\Health\ActiveHealthCheckMonitor.cs:
   61:                 Log.ExplicitActiveCheckOfAllClustersHealthFailed(_logger, ex);

  116:         Log.StartingActiveHealthProbingOnCluster(_logger, cluster.ClusterId);

  142:             Log.ActiveHealthProbingFailedOnCluster(_logger, cluster.ClusterId, ex);

  156:                 Log.ErrorOccuredDuringActiveHealthProbingShutdownOnCluster(_logger, cluster.ClusterId, ex);
  158  
  159:             Log.StoppedActiveHealthProbingOnCluster(_logger, cluster.ClusterId);

  176:             Log.ActiveHealthProbeConstructionFailedOnCluster(_logger, destination.DestinationId, cluster.ClusterId, ex);
  177  

  187:             Log.SendingHealthProbeToEndpointOfDestination(_logger, request.RequestUri, destination.DestinationId, cluster.ClusterId);
  189:             Log.DestinationProbingCompleted(_logger, destination.DestinationId, cluster.ClusterId, (int)response.StatusCode);

  197:             Log.DestinationProbingFailed(_logger, destination.DestinationId, cluster.ClusterId, ex);

src\ReverseProxy\Health\DestinationHealthUpdater.cs:
   43:                     Log.ActiveDestinationHealthStateIsSetToUnhealthy(_logger, destination.DestinationId, cluster.ClusterId);

   47:                     Log.ActiveDestinationHealthStateIsSet(_logger, destination.DestinationId, cluster.ClusterId, newHealth);

   85:             Log.UnhealthyDestinationIsScheduledForReactivation(_logger, destination.DestinationId, reactivationPeriod);

  100:             Log.PassiveDestinationHealthResetToUnkownState(_logger, destination.DestinationId);

src\ReverseProxy\LoadBalancing\LoadBalancingMiddleware.cs:
  59:             Log.NoAvailableDestinations(_logger, proxyFeature.Cluster.Config.ClusterId);

src\ReverseProxy\Model\ProxyPipelineInitializerMiddleware.cs:
  40:             Log.NoClusterFound(_logger, route.Config.RouteId);

src\ReverseProxy\WebSocketsTelemetry\WebSocketsTelemetryMiddleware.cs:
  56:                 WebSocketsTelemetry.Log.WebSocketClosed(

  80:                 WebSocketsTelemetry.Log.WebSocketClosed(

adityamandaleeka commented 1 year ago

@timmydo Thanks for reporting this. We believe there's value in making some of these metrics, but perhaps not all of them (e.g. error conditions that should really be addressed immediately).

If you could provide a list of the ones that are the highest priority to add metrics for, that would be great.

adityamandaleeka commented 1 year ago

Triage: perhaps another interesting metric might be a catch-all "this request couldn't be proxied" metric.

timmydo commented 1 year ago

Counters for these files above look to be higher priority, with higher priority being to logs that would be error or warning:

src\ReverseProxy\Forwarder\ForwarderMiddleware.cs
src\ReverseProxy\Health\DestinationHealthUpdater.cs
src\ReverseProxy\Health\ActiveHealthCheckMonitor.cs
src\ReverseProxy\LoadBalancing\LoadBalancingMiddleware.cs

having counters for successful requests is useful also to validate whether no errors is a data-missing scenario or no errors. Also, you can measure the reliability.

microsoft / reverse-proxy

Request to have metrics emitted on health check failures #2255

What should we add or change to make your life better?

Why is this important to you?