prometheus / cloudwatch_exporter

Metrics exporter for Amazon AWS CloudWatch
Apache License 2.0
890 stars 323 forks source link

[metrics]: Not able to go into a firing state when VPN tunnel is down for VPN-tunnel and ALB, CLB, & NLB #639

Open rajualap opened 7 months ago

rajualap commented 7 months ago

Hi ,

I have created a 'cloudwatch-exporter.yml' file to fetch metrics from CloudWatch for RDS, Lambda, VPN-tunnel, ALB, CLB, and NLB. We are successfully obtaining metrics for RDS and Lambda, and on Prometheus, we can see RDS and Lambda metrics. However, when there is an issue with RDS and Lambda, alert rules go into a firing state and generate alerts. Unfortunately, we are not receiving alerts for VPN-tunnel and ALB, CLB, & NLB. Can you please help with identifying the reason? Below, you'll find the 'cloudwatch-exporter.yml' file and alert rules.

Please assist in resolving this issue

cloudwatch-exporter.yml file here :-

region: ap-south-1
metrics:
  - aws_namespace: AWS/RDS
    aws_metric_name: BurstBalance
    aws_dimensions: [DBInstanceIdentifier]
    aws_statistics: [Average]

  - aws_namespace: AWS/RDS
    aws_metric_name: FreeableMemory
    aws_dimensions: [DBInstanceIdentifier]
    aws_statistics: [Average]

  - aws_namespace: AWS/RDS
    aws_metric_name: CPUUtilization
    aws_dimensions: [DBInstanceIdentifier]
    aws_statistics: [Average]

  - aws_namespace: AWS/RDS
    aws_metric_name: DatabaseConnections
    aws_dimensions: [DBInstanceIdentifier]
    aws_statistics: [Average]

  - aws_namespace: AWS/Lambda
    aws_metric_name: Duration
    aws_dimensions: [FunctionName]
    aws_statistics: [Average]

  - aws_namespace: AWS/Lambda
    aws_metric_name: Errors
    aws_dimensions: [FunctionName]
    aws_statistics: [Sum]

  - aws_namespace: AWS/Lambda
    aws_metric_name: Invocations
    aws_dimensions: [FunctionName]
    aws_statistics: [Sum]

  - aws_namespace: AWS/ElasticLoadBalancing
    aws_metric_name: UnHealthyHostCount
    aws_dimensions: [LoadBalancerName]
    aws_statistics: [Average]

  - aws_namespace: AWS/ElasticLoadBalancing
    aws_metric_name: RequestCount
    aws_dimensions: [LoadBalancerName]
    aws_statistics: [Sum]

  - aws_namespace: AWS/VPN
    aws_metric_name: TunnelState
    aws_dimensions: [VpnId]
    aws_statistics: [Average]

####################################

Prometheus VPNtunnel alerts file here 👎

groups:
  - name: VPNAlerts
    rules:
      # Alert if the average VPN tunnel state is less than 1 (indicating down) for 5 minutes
      - alert: VPNDownCritical
        expr: aws_vpn_tunnel_state_average < 1
        for: 5m
        labels:
          severity: critical
        annotations:
          LABELS: '{{ $labels }}'
          VALUE: '{{ $value }}'
          summary: 'VPN Tunnel Down Critical'
          description: 'At least one VPN tunnel is down.'

      # Alert if the average VPN tunnel state is less than 1 for 1 minute
      - alert: VPNDownWarning
        expr: aws_vpn_tunnel_state_average < 1
        for: 1m
        labels:
          severity: warning
        annotations:
          LABELS: '{{ $labels }}'
          VALUE: '{{ $value }}'
          summary: 'VPN Tunnel Down Warning'
          description: 'At least one VPN tunnel is down.'

      # Alert if there are changes in VPN tunnel state indicating flapping for 5 minutes
      - alert: VPNFlapping
        expr: changes(aws_vpn_tunnel_state_average[5m]) > 1
        for: 5m
        labels:
          severity: critical
        annotations:
          LABELS: '{{ $labels }}'
          VALUE: '{{ $value }}'
          summary: 'VPN Tunnel Flapping'
          description: 'At least one VPN tunnel is experiencing flapping.'

Cloudwatch Metrics here

image010

matthiasr commented 5 months ago

What does aws_vpn_tunnel_state_average look like in the /metrics endpoint? What does it look like in the Prometheus graph and table views?

It seems that you are using the default delay_seconds and set_timestamp. This means the metrics are not visible to an instant query in Prometheus "now", as your rules are using – see the documentation for details.

Try min_over_time(aws_vpn_tunnel_state_average[15m]) < 1 and changes(aws_vpn_tunnel_state_average[30m]) > 1 to look back further.