ministryofjustice / modernisation-platform

A place for the core work of the Modernisation Platform • This repository is defined and managed in Terraform
https://user-guide.modernisation-platform.service.justice.gov.uk
MIT License
680 stars 290 forks source link

NAT Alarms for Core network services account #7724

Open markgov opened 1 month ago

markgov commented 1 month ago

User Story

As a modernisation platform engineer I want to monitor NAT Traffic and packet drops So that we can get ahead of any networking issues

Value / Purpose

as part of issue New Cloud watch alarms #7450 we identified that only the core shared service account needed the NAT alarms this issues covers the creation of these alarms example of the alarm code is as follows

resource "aws_cloudwatch_log_metric_filter" "NATGatewayErrorPortAllocation" {
  name           = var.error_port_allocation_metric_filter_name
  pattern        = "{ $.eventSource = \"ec2.amazonaws.com\" && $.eventName = \"CreateNatGateway\" && $.errorCode = \"*\" && $.errorMessage = \"*Port Allocation*\" }"
  log_group_name = "cloudtrail"

  metric_transformation {
    name      = "ErrorPortAllocation"
    namespace = "NAT/Gateway"
    value     = "1"
  }
}

resource "aws_cloudwatch_metric_alarm" "ErrorPortAllocation" {
  alarm_name        = var.error_port_allocation_alarm_name
  alarm_description = "This alarm detects when the NAT Gateway is unable to allocate ports to new connections."
  alarm_actions     = [aws_sns_topic.securityhub-alarms.arn]

  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "ErrorPortAllocation"
  namespace           = "NAT/Gateway"
  period              = "300"
  statistic           = "Sum"
  threshold           = "0"
  treat_missing_data  = "notBreaching"

  tags = var.tags
}

# NAT PacketsDropCount alarm
resource "aws_cloudwatch_metric_alarm" "nat_packets_drop_count_all" {
  alarm_name          = var.nat_packets_drop_count_all_alarm_name
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  threshold           = "100" # Adjust this threshold as needed
  alarm_description   = "NAT Gateways are dropping packets. This might indicate an issue with one or more NAT Gateways."

  metric_query {
    id          = "e1"
    expression  = "SUM(METRICS())"
    label       = "Total Dropped Packets"
    return_data = "true"
  }

  metric_query {
    id = "m1"
    metric {
      metric_name = "PacketsDropCount"
      namespace   = "AWS/NATGateway"
      period      = 60
      stat        = "Sum"
    }
  }

  alarm_actions = [aws_sns_topic.securityhub-alarms.arn]
  tags          = var.tags
}

Useful Contacts

No response

Additional Information

These alarm only need to specifically monitor the egress VPCs (live_data and non_live_data)

Definition of Done

markgov commented 4 days ago

branch created and new code added for new alarms pr to go up shortly

markgov commented 4 days ago

PR created https://github.com/ministryofjustice/modernisation-platform/pull/7969

markgov commented 4 days ago

New Alarms have been created and data is now showing this is all in the core-networking account this concludes the requirements of the issue and is now ready for review

SteveLinden commented 3 days ago

Possibly a little late now but doesn't this need to be amended to be a value (rather than a string) as the previous one? threshold = "0"

Do you want to amend the code above to show the threshold = 100 rather than the one that is quoted above?

markgov commented 3 days ago

yep i did