ministryofjustice / modernisation-platform

A place for the core work of the Modernisation Platform • This repository is defined and managed in Terraform
https://user-guide.modernisation-platform.service.justice.gov.uk
MIT License
683 stars 291 forks source link

Spike: Can we monitor and alert on IP usage #8024

Open davidkelliott opened 4 weeks ago

davidkelliott commented 4 weeks ago

User Story

As an MP engineer I want to be alerted if we start to run out of IP addresses in our shared core-vpc subnets So that I can take action to prevent hitting limits

Value / Purpose

This will address risk 26, by being able to monitor IP addresses in use (the only shared resource that could potentially be abused) we ensure that we find out before it's too late if we start to reach the IP limit or if one application uses them excessively.

Useful Contacts

No response

Additional Information

Could we make use of capacity reservations - https://aws.amazon.com/about-aws/whats-new/2023/08/amazon-vpc-ip-address-utilization-metrics-aws-resources/

Or a script?

Definition of Done

dms1981 commented 1 week ago

We can monitor the remaining IP addresses in VPCs and subnets through the use of VpcIPUsage and SubnetIPUsage via VPC IPAM in the organisational-security account.

Breaching alarms can then be reported through an SNS topic into AWS ChatBot or PagerDuty as appropriate.

We'd need to select a VPC IPAM scope to create pools which we RAM share out to the relevant OU - probably Modernisation Platform Core - where it can be accepted and associated with our core-vpc-production VPCs.

CloudWatch Metric alarms would need to be configured in the organisational-security account.

dms1981 commented 1 week ago

https://aws.amazon.com/about-aws/whats-new/2023/08/amazon-vpc-ip-address-utilization-metrics-aws-resources/

https://docs.aws.amazon.com/vpc/latest/ipam/cloudwatch-ipam-res-util.html

https://aws.amazon.com/blogs/networking-and-content-delivery/amazon-vpc-ip-address-manager-best-practices/

dms1981 commented 1 week ago

From some light reading of the best practices document I would suggest two VPC IPAM pools associated with the private scope: modernisation-platform-development-test and modernisation-platform-preproduction-production.

As there is no subdivision in the Modernisation Platform Core OU between dev-test and preprod-prod, if we're only interested in monitoring IP address utilisation in our core-vpc-* accounts, then RAM sharing with the relevant core-vpc-* accounts is the most constrained method. However, if we're interested in also monitoring IP utilisation in our other core-* VPCs we may as well share direct to the OU.

VPCs would then be associated with the relevant VPC IPAM pool.

Maintaining two separate pools would allow us to be more selective in how we report our alerts - while impending address exhaustion in a dev/test VPC would be disruptive, it wouldn't potentially be service breaking like the same event in a preprod/prod VPC.

dms1981 commented 6 hours ago

I looked into this further with AWS Support. We found a parsing error around region/Region that has been reported back to the IPAM team which was affecting our ability to view all resources in a region. Beyond that, though, I think this one needs some rethinking in light of how metrics are passed in from IPAM to CloudWatch.

You can see the documentation here: https://docs.aws.amazon.com/vpc/latest/ipam/cloudwatch-ipam-res-util.html

An alternative that I'm exploring is enabling Network Address Usage metric settings in VPCs, and tracking the NetworkAddressUsage metric as described here: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-cloudwatch.html