Open maskati opened 1 year ago
Is there any information on this feature request? I cannot find any place to actually see the status of the health probes I set up, and getting them as a metric would be great.
Hi, Has there any update to this enhancement request?
Not sure if this is showing availability based on probe state, but Diagnose and solve problems -> Availability and Performance -> Availability view shows a graph of "availability". Have not investigated further.
This seems to be sourced from a Microsoft.App/containerapps/detectors
named cappContainerAppAvailabilityDetector
. You can list other detectors with List Detectors and get the data with Get Detector, but it's not exactly in an easy to consume format.
There seem to be a bunch of detectors emitting various data. It would be nice to be able to consume at least some of these as standard Azure Monitor metrics. I am getting the following detectors:
id name category description
-- ---- -------- -----------
AutoScalingErrors Auto Scaling Errors Availability and Performance This detector shows you auto scaling (KEDA) errors occurred to your container app.
BilledQuantityWithAppAliveCount Usage Quantity With Replicas Configuration and Management Detector to show the usage quantities per metric with replica count
BilledQuantityWithAppAliveCountWestEurope West Europe Billing Issue Configuration and Management Detector to show the underbilling usage quantities per metric with replica count for West Europe
cappcertificates SSL and Domains SSL and Domains This detector shows SSL and custom domain related issues for your Container App and Container App Environment.
cappconfigandmanagement Configuration and Management Configuration and Management This detector shows configuration and management issues of your container app.
cappContainerAppAvailabilityDetector Container App Availability Metrics Availability and Performance Analyze App and Platform availability and monitor the requests and failures to your container app
cappContainerAppAvailabilityMetrics Container App Down Availability and Performance Analyze App and Platform availability and monitor the requests and failures to your container app
cappcontainerappclustercreation Container App Env Creation Error Detector Container Apps Environment This detector shows you some known issues regarding cluster creation
cappcontainerappcpu Container App CPU Usage Availability and Performance This detector shows the Container App CPU usage.
cappcontainerappmemory Container App Memory Usage Availability and Performance This detector shows the memory usage of the specified container app.
cappcontainerappnetworkusage Container App Network Inbound and Outbound Usage Availability and Performance This detector shows the Container App Network Inbound and Outbound usages.
cappcontainerapprevisions Revisions List of revisions of the container app
cappdeploymentFailures Deployment Failures Deployment This detector checks for deployment failures.
clustersubnet Cluster Subnet Container Apps Environment Looks for issues with cluster subnet configuration
ContainerAppEnvironmentEvents Container Environment Events Container Apps Environment
ContainerAppsRevisionComparsion Container Apps Revisions Comparison Configuration and Management Track the differences between two seperate revisisons
containerenvinsights Container Environment Insights Container Environment Insights This detector shows various insights of your container environment.
DaprInsights Dapr Components Insights Dapr Component Insights This detector shows various insights of Dapr Components for your Container App Environment.
EasyAuthConfigurationErrors EasyAuth Configuration Errors Configuration and Management This detector shows you errors of EasyAuth occurred to your container app.
snatusage SNAT Connection and Port Allocation Availability and Performance Checks SNAT connection counts and port allocation per host for any cluster outbound IPs.
Any progress on this?
For example to detect issues like this https://github.com/microsoft/azure-container-apps/issues/1025 , the readiness states for the containers would help
I find it perplexing that such a basic and vital metric is not provided. What is even more surprising is that the HealthProbeStatus
for the Ingress/Load Balancer shows an average value of 66.7 across all my Container Apps, despite there being no errors related to health probes in the Container Apps/System Logs
. The HealthProbeStatus
metric displays a flat line with a minimum value of 0, a maximum value of 100, and an average value of 66.7 (surprise).
Can someone explain how this is possible? How 100+0 can give 66.7? (Why not use blackbox_exporter approach with simple logic of 0 and 1?)
And why there is no metrics from Load Balancer which is attached to every CA Environment?
@loadaverage note that the load balancer is an internal managed resource and it's health probe status does not directly correlate with the health of your environment or apps.
@loadaverage note that the load balancer is an internal managed resource and it's health probe status does not directly correlate with the health of your environment or apps.
I was also thinking about that, however I don't understand why this metric is published at all, because it can't be used anywhere from customer's perspective. There are some useful LB metric (SYN Count, Byte Count, etc) though.
My initial idea was to get any metric from network layer (Request Timeout, Request Failed, Connection Refused, etc) that can be used in alerts. For now I have only Restart Count from CA metrics and external thingy to ping health endpoint of my CA.
Those are just standard Azure load balancer metrics which are not specific to Container Apps. Container Apps just happens to use Azure Load Balancer as part of the backend infrastructure, and you can view the metrics as you can with any Azure load balancer. But the health probe status does not really tell you anything about the health of your Container App environment or apps, since Microsoft manages the backend pool nodes for you. We really need Microsoft to publish health metrics on the Container Apps app level, which would indicate the health as defined by the app's probes.
Since we currently don't get the health probe results as Azure Monitor metrics, what are the alternatives to set up alerting on application health with Azure Container Apps?
I'm Interested what others do at the moment.
Since we currently don't get the health probe results as Azure Monitor metrics, what are the alternatives to set up alerting on application health with Azure Container Apps?
* Via a custom metric or log whenever the health endpoint is called? * Using Application Insights Availability Tests? In a private network scenario, this would require to limit the ingress traffic of a container app to a VNet and probably using Azure Monitor Private Link Scope. But if authentication is enabled, one would need to use custom availability tests with e.g. Azure Functions..
I'm Interested what others do at the moment.
I'm using:
I believe, that should be enough to cover most cases for user-facing endpoints. For private endpoints, restart count
and 5xx
should be a good foundation for alert rules.
Thanks for the feedback @loadaverage.
What's your alert definition for the Replica Restart Count
if I may ask?
Thanks for the feedback @loadaverage. What's your alert definition for the
Replica Restart Count
if I may ask?
I'm using: >= 2
within 5 minutes window, because I have relatively high Period seconds
and Initial delay seconds
on some of Container Apps.
This this the exact Terraform configuration:
resource "azurerm_monitor_metric_alert" "replica_restart_count" {
name = "replica-restart-count"
enabled = true
resource_group_name = azurerm_resource_group.shared.name
scopes = [module.myapp.container_app_id]
description = "Replica Restart Count is greater than or equal to specified threshold"
auto_mitigate = true
# 0: Critical, 1: Error, 2: Warning, 3: Informational, 4: Verbose
severity = 0
# Lookback period: PT1M, PT5M, PT15M, PT30M, PT1H, PT6H, PT12H, P1D
window_size = "PT5M"
# Check every: PT1M, PT5M, PT15M, PT30M, PT1H
frequency = "PT1M"
tags = local.myapp_tags
criteria {
metric_namespace = "Microsoft.App/containerApps"
metric_name = "RestartCount"
aggregation = "Maximum"
operator = "GreaterThanOrEqual"
threshold = 2
}
action {
action_group_id = azurerm_monitor_action_group.default.id
}
}
Is your feature request related to a problem? Please describe.
Container Apps startup/liveness/readiness health probes are not published to Container App metrics, where it they could be used for alerting or even external ingress routing decisions.
Describe the solution you'd like.
Publish Container App health probe information as metrics. Ideally provide separate metrics for each probe type and include Replica and Revision dimensions. For example App Service publishes its own health check results as a Health Check Status metric.
Describe alternatives you've considered.
Infer failed liveness probe based on non-zero "Replica Restart Count" metric. If the replica continuously restarts due to a non-transient issue, this does not give an accurate indication of the current health state. It also does not provide a means to infer readiness, since this does not cause a restart.
Readiness could be monitored by an external probe, but this adds complexity. In addition, unlike internal health probes, external probes will move the app from idle to active usage metering, so high frequency polling would also have a cost impact.