microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
357 stars 27 forks source link

Feature Request: Publish health probe results as Azure Monitor metrics #541

Open maskati opened 1 year ago

maskati commented 1 year ago

Is your feature request related to a problem? Please describe.
Container Apps startup/liveness/readiness health probes are not published to Container App metrics, where it they could be used for alerting or even external ingress routing decisions.

Describe the solution you'd like.
Publish Container App health probe information as metrics. Ideally provide separate metrics for each probe type and include Replica and Revision dimensions. For example App Service publishes its own health check results as a Health Check Status metric.

Describe alternatives you've considered.
Infer failed liveness probe based on non-zero "Replica Restart Count" metric. If the replica continuously restarts due to a non-transient issue, this does not give an accurate indication of the current health state. It also does not provide a means to infer readiness, since this does not cause a restart.

Readiness could be monitored by an external probe, but this adds complexity. In addition, unlike internal health probes, external probes will move the app from idle to active usage metering, so high frequency polling would also have a cost impact.

mortenf1984 commented 1 year ago

Is there any information on this feature request? I cannot find any place to actually see the status of the health probes I set up, and getting them as a metric would be great.

ranat5 commented 6 months ago

Hi, Has there any update to this enhancement request?

maskati commented 6 months ago

Not sure if this is showing availability based on probe state, but Diagnose and solve problems -> Availability and Performance -> Availability view shows a graph of "availability". Have not investigated further.

This seems to be sourced from a Microsoft.App/containerapps/detectors named cappContainerAppAvailabilityDetector. You can list other detectors with List Detectors and get the data with Get Detector, but it's not exactly in an easy to consume format.

There seem to be a bunch of detectors emitting various data. It would be nice to be able to consume at least some of these as standard Azure Monitor metrics. I am getting the following detectors:

id                                        name                                             category                       description
--                                        ----                                             --------                       -----------
AutoScalingErrors                         Auto Scaling Errors                              Availability and Performance   This detector shows you auto scaling (KEDA) errors occurred to your container app.
BilledQuantityWithAppAliveCount           Usage Quantity With Replicas                     Configuration and Management   Detector to show the usage quantities per metric with replica count
BilledQuantityWithAppAliveCountWestEurope West Europe Billing Issue                        Configuration and Management   Detector to show the underbilling usage quantities per metric with replica count for West Europe
cappcertificates                          SSL and Domains                                  SSL and Domains                This detector shows SSL and custom domain related issues for your Container App and Container App Environment.
cappconfigandmanagement                   Configuration and Management                     Configuration and Management   This detector shows configuration and management issues of your container app.
cappContainerAppAvailabilityDetector      Container App Availability Metrics               Availability and Performance   Analyze App and Platform availability and monitor the requests and failures to your container app
cappContainerAppAvailabilityMetrics       Container App Down                               Availability and Performance   Analyze App and Platform availability and monitor the requests and failures to your container app
cappcontainerappclustercreation           Container App Env Creation Error Detector        Container Apps Environment     This detector shows you some known issues regarding cluster creation
cappcontainerappcpu                       Container App CPU Usage                          Availability and Performance   This detector shows the Container App CPU usage.
cappcontainerappmemory                    Container App Memory Usage                       Availability and Performance   This detector shows the memory usage of the specified container app.
cappcontainerappnetworkusage              Container App Network Inbound and Outbound Usage Availability and Performance   This detector shows the Container App Network Inbound and Outbound usages.
cappcontainerapprevisions                 Revisions                                                                       List of revisions of the container app
cappdeploymentFailures                    Deployment Failures                              Deployment                     This detector checks for deployment failures.
clustersubnet                             Cluster Subnet                                   Container Apps Environment     Looks for issues with cluster subnet configuration
ContainerAppEnvironmentEvents             Container Environment Events                     Container Apps Environment
ContainerAppsRevisionComparsion           Container Apps Revisions Comparison              Configuration and Management   Track the differences between two seperate revisisons
containerenvinsights                      Container Environment Insights                   Container Environment Insights This detector shows various insights of your container environment.
DaprInsights                              Dapr Components Insights                         Dapr Component Insights        This detector shows various insights of Dapr Components for your Container App Environment.
EasyAuthConfigurationErrors               EasyAuth Configuration Errors                    Configuration and Management   This detector shows you errors of EasyAuth occurred to your container app.
snatusage                                 SNAT Connection and Port Allocation              Availability and Performance   Checks SNAT connection counts and port allocation per host for any cluster outbound IPs.
floriankoch commented 6 months ago

Any progress on this?
For example to detect issues like this https://github.com/microsoft/azure-container-apps/issues/1025 , the readiness states for the containers would help

loadaverage commented 1 month ago

I find it perplexing that such a basic and vital metric is not provided. What is even more surprising is that the HealthProbeStatus for the Ingress/Load Balancer shows an average value of 66.7 across all my Container Apps, despite there being no errors related to health probes in the Container Apps/System Logs. The HealthProbeStatus metric displays a flat line with a minimum value of 0, a maximum value of 100, and an average value of 66.7 (surprise).

Can someone explain how this is possible? How 100+0 can give 66.7? (Why not use blackbox_exporter approach with simple logic of 0 and 1?)

image

image

image

And why there is no metrics from Load Balancer which is attached to every CA Environment?

maskati commented 1 month ago

@loadaverage note that the load balancer is an internal managed resource and it's health probe status does not directly correlate with the health of your environment or apps.

loadaverage commented 1 month ago

@loadaverage note that the load balancer is an internal managed resource and it's health probe status does not directly correlate with the health of your environment or apps.

I was also thinking about that, however I don't understand why this metric is published at all, because it can't be used anywhere from customer's perspective. There are some useful LB metric (SYN Count, Byte Count, etc) though.

My initial idea was to get any metric from network layer (Request Timeout, Request Failed, Connection Refused, etc) that can be used in alerts. For now I have only Restart Count from CA metrics and external thingy to ping health endpoint of my CA.

maskati commented 1 month ago

Those are just standard Azure load balancer metrics which are not specific to Container Apps. Container Apps just happens to use Azure Load Balancer as part of the backend infrastructure, and you can view the metrics as you can with any Azure load balancer. But the health probe status does not really tell you anything about the health of your Container App environment or apps, since Microsoft manages the backend pool nodes for you. We really need Microsoft to publish health metrics on the Container Apps app level, which would indicate the health as defined by the app's probes.

johschmidt42 commented 3 weeks ago

Since we currently don't get the health probe results as Azure Monitor metrics, what are the alternatives to set up alerting on application health with Azure Container Apps?

I'm Interested what others do at the moment.

loadaverage commented 3 weeks ago

Since we currently don't get the health probe results as Azure Monitor metrics, what are the alternatives to set up alerting on application health with Azure Container Apps?

* Via a custom metric or log whenever the health endpoint is called?

* Using Application Insights Availability Tests? In a private network scenario, this would require to limit the ingress traffic of a container app to a VNet and probably using Azure Monitor Private Link Scope. But if authentication is enabled, one would need to use custom availability tests with e.g. Azure Functions..

I'm Interested what others do at the moment.

I'm using:

I believe, that should be enough to cover most cases for user-facing endpoints. For private endpoints, restart count and 5xx should be a good foundation for alert rules.

johschmidt42 commented 3 weeks ago

Thanks for the feedback @loadaverage. What's your alert definition for the Replica Restart Count if I may ask?

loadaverage commented 2 weeks ago

Thanks for the feedback @loadaverage. What's your alert definition for the Replica Restart Count if I may ask?

I'm using: >= 2 within 5 minutes window, because I have relatively high Period seconds and Initial delay seconds on some of Container Apps.

This this the exact Terraform configuration:

resource "azurerm_monitor_metric_alert" "replica_restart_count" {
  name                = "replica-restart-count"
  enabled             = true
  resource_group_name = azurerm_resource_group.shared.name
  scopes              = [module.myapp.container_app_id]
  description         = "Replica Restart Count is greater than or equal to specified threshold"
  auto_mitigate       = true
  # 0: Critical, 1: Error, 2: Warning, 3: Informational, 4: Verbose
  severity = 0
  # Lookback period: PT1M, PT5M, PT15M, PT30M, PT1H, PT6H, PT12H, P1D
  window_size = "PT5M"
  # Check every: PT1M, PT5M, PT15M, PT30M, PT1H
  frequency = "PT1M"
  tags      = local.myapp_tags

  criteria {
    metric_namespace = "Microsoft.App/containerApps"
    metric_name      = "RestartCount"
    aggregation      = "Maximum"
    operator         = "GreaterThanOrEqual"
    threshold        = 2
  }

  action {
    action_group_id = azurerm_monitor_action_group.default.id
  }
}