truemark / autoalarm

Tag based alarm generation automation
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

AutoAlarm Project README

Overview

AutoAlarm is an AWS Lambda-based automation tool designed to dynamically manage CloudWatch alarms for ALBs, EC2 Instances, OpenSearch, SQS, and Target Groups based on instance states and specific tag values. The project uses AWS SDK for JavaScript v3, the AWS CDK for infrastructure deployment, and is integrated with AWS Lambda and CloudWatch for automated cloud observability.

Architecture

The AutoAlarm project is designed to be deployed with minimal configuration, creating all necessary AWS resources for full functionality. Upon deployment, the project automatically provisions the following components:

This architecture ensures that AutoAlarm can monitor and manage resources out-of-the-box, including ALBs, EC2 instances, OpenSearch domains, SQS queues, and Target Groups. The system is fully event-driven, dynamically responding to state and tag changes across these resources.

Special Considerations:

Deployment Process

Prerequisites

Before you begin, ensure you have the following:

  1. AWS CLI: Installed and configured with appropriate access to your AWS account.
  2. AWS CDK
  3. Node.js
  4. Git
  5. pnpm: Version 9.1.4 or later.

To set up and deploy the AutoAlarm project, follow these steps:

  1. Clone the Repository

    Start by cloning the project repository to your local machine:

git clone https://github.com/truemark/autoalarm.git
cd autoalarm
  1. Install Dependencies
    pnpm install
  2. Configure Region
    export AWS_REGION=<region>
  3. Configure Keys and Session Token
    export AWS_ACCESS_KEY_ID="<access-key-id"
    export AWS_SECRET_ACCESS_KEY="<secret-access-key>"
    export AWS_SESSION_TOKEN="<aws-session-token>"
  4. Bootstrap the CDK
    cdk bootstrap
  5. Build the Project
    pnpm build
  6. Deploy the Stack
    cd cdk ; cdk deploy AutoAlarm

Features

AWS Services Used

1. Amazon CloudWatch

Amazon CloudWatch is utilized for monitoring and alerting. CloudWatch alarms are created, updated, or deleted by the Lambda function to track various metrics such as CPU utilization, memory usage, storage usage, ALB metrics, and Target Group metrics. CloudWatch Logs are also used to store log data generated by the Lambda function for debugging and auditing purposes.

2. Amazon EC2

Amazon EC2 is the primary service monitored by AutoAlarm. The Lambda function responds to state change notifications and tag change events for EC2 instances, creating or updating alarms based on the instance's state and tags.

3. Amazon EventBridge

Amazon EventBridge is used to route events to the Lambda function. Rules are set up to listen for specific events such as state changes, tag changes, and other resource events. These events trigger the Lambda function to perform the necessary alarm management actions. Additionally when using TrueMark's Enterprise Operation center eventbridge is monitored for triggered alarms eliminating the need to send them to a specific SNS topic.

4. Amazon Simple Queue Service (SQS)

Amazon SQS is used as a dead-letter queue for the Lambda function. If the Lambda function fails to process an event, the event is sent to an SQS queue for further investigation and retry.

5. AWS Elastic Load Balancing (ELB) And Target Groups

ELB is monitored by AutoAlarm for events related to Application Load Balancers (ALBs) and Target Groups. The Lambda function creates, updates, or deletes alarms for ALB metrics and target group metrics based on events and tags.

6. AWS OpenSearch Service (OS)

OS is monitored by AutoAlarm for events related to OpenSearch Service. The Lambda function creates, updates, or deletes alarms for OS metrics based on events and tags.

7. AWS Identity and Access Management (IAM)

IAM is used to define roles and policies that grant the necessary permissions to the Lambda function. These roles allow the function to interact with other AWS services such as CloudWatch, EC2, AMP, SQS, and EventBridge.

8. AWS Lambda

AWS Lambda is used to run the main AutoAlarm function, which processes service and tag events in addition to managing alarms. The Lambda function is responsible for handling the logic to create, update, or delete CloudWatch alarms and Prometheus rules based on tags and state changes.

Usage

The system is event-driven, responding to EC2 state change notifications and tag modification events. To manage alarms , ensure your supported resources are tagged according to the schema defined below.

Tag Values and Behaviour

Overview

Tags are used to customize CloudWatch alarms for various AWS services managed by AutoAlarm. By applying specific tags to resources such as EC2 instances, ALBs, Target Groups, SQS, and OpenSearch, you can define custom thresholds, evaluation periods, and other parameters for both static threshold and anomaly detection alarms. The following sections outline the default configurations and explain how you can modify them using these tags.

Default Values

AutoAlarm comes with predefined default values for various alarms. These defaults are designed to provide general monitoring out-of-the-box. However, it is crucial that any enabled alarms are reviewed to ensure they align with the specific needs of your application and environment. Default alarms can be created by setting the autoalarm:enabled tag to true on the resource.

Customizing Alarms with Tags

When setting up non-default alarms with tags, you must provide at least the first two values (warning and critical thresholds) for the tag to function correctly. If these thresholds are not supplied, the alarm will not be created unless defaults are defined.

The following schema is used to define tag values for all tags:

Warning Threshold / Critical Threshold / Period / Evaluation Periods / Statistic / Datapoints to Alarm / ComparisonOperator / Missing Data Treatment

Example:

autoalarm:cpu=80/95/60/5/Maximum/5/GreaterThanThreshold/ignore

Static Threshold vs Anomaly Detection Alarms

All Anomaly alarm tags contain 'anomaly' in tag name.

Static Threshold Alarms:

Anomaly Detection Alarms:

Supported Tag Values

Warning and Critical Thresholds:

Period:

Data Points to Alarm:

Number of Periods:

Statistic:

You can use the following statistics for alarms - https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Statistics-definitions.html.

Missing Data Treatment

Valid Comparison Operators

Static Threshold Alarms

Anomaly Detection Alarms

Tag Configuration for Supported Resources

Threshold values that contain '-' are undefined and will default to not creating the alarm if the warning and critical threshold values are not provided in the tag value when setting the tag on the resource.

Application Load Balancer (ALB)

Tag Default Value Enabled By Default Standard CloudWatch Metrics
autoalarm:4xx-count "-/-/60/2/Sum/2/GreaterThanThreshold/ignore" No Yes
autoalarm:4xx-count-anomaly "2/5/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:5xx-count "-/-/60/2/Sum/2/GreaterThanThreshold/ignore" No Yes
autoalarm:5xx-count-anomaly "2/5/300/2/Average/2/GreaterThanUpperThreshold/ignore" Yes Yes
autoalarm:request-count "-/-/60/2/Sum/2/GreaterThanThreshold/ignore" No Yes
autoalarm:request-count-anomaly "3/5/300/2/Average/2/GreaterThanUpperThreshold/ignore" No Yes

EC2

Some Metrics require the CloudWatch Agent to be installed on the host.

Tag Default Value Enabled By Default Standard CloudWatch Metrics
autoalarm:cpu "95/98/60/5/Maximum/5/GreaterThanThreshold/ignore" Yes Yes
autoalarm:cpu-anomaly "2/5/60/5/Average/5/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:memory "95/98/60/10/Maximum/10/GreaterThanThreshold/ignore" Yes No (Requires CloudWatch Agent Install on Host)
autoalarm:memory-anomaly "2/5/300/2/Average/2/GreaterThanUpperThreshold/ignore" No No (Requires CloudWatch Agent Install on Host)
autoalarm:storage "90/95/60/2/Maximum/1/GreaterThanThreshold/ignore" Yes No (Requires CloudWatch Agent Install on Host)
autoalarm:storage-anomaly "2/3/60/2/Average/1/GreaterThanUpperThreshold/ignore" No No (Requires CloudWatch Agent Install on Host)

OpenSearch

Tag Default Value Enabled By Default Standard CloudWatch Metrics
autoalarm:4xx-errors "100/300/300/1/Sum/1/GreaterThanThreshold/ignore" No Yes
autoalarm:4xx-errors-anomaly "-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:5xx-errors "10/50/300/1/Sum/1/GreaterThanThreshold/ignore" Yes Yes
autoalarm:5xx-errors-anomaly "-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:cpu "98/98/300/1/Maximum/1/GreaterThanThreshold/ignore" Yes Yes
autoalarm:cpu-anomaly "2/2/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:iops-throttle "5/10/300/1/Sum/1/GreaterThanThreshold/ignore" Yes Yes
autoalarm:iops-throttle-anomaly "-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:jvm-memory "85/92/300/1/Maximum/1/GreaterThanThreshold/ignore" Yes Yes
autoalarm:jvm-memory-anomaly "-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:read-latency "0.03/0.08/60/2/Maximum/2/GreaterThanThreshold/ignore" Yes Yes
autoalarm:read-latency-anomaly "2/6/300/2/Average/2/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:search-latency "1/2/300/2/Average/2/GreaterThanThreshold/ignore" Yes Yes
autoalarm:search-latency-anomaly "-/-/300/2/Average/2/GreaterThanUpperThreshold/ignore" Yes Yes
autoalarm:snapshot-failure "-/1/300/1/Sum/1/GreaterThanOrEqualToThreshold/ignore" Yes Yes
autoalarm:storage "10000/5000/300/2/Average/2/LessThanThreshold/ignore" Yes Yes
autoalarm:storage-anomaly "2/3/300/2/Average/2/GreaterThanUpperThreshold/ignore" Yes Yes
autoalarm:throughput-throttle "40/60/60/2/Sum/2/GreaterThanThreshold/ignore" No Yes
autoalarm:throughput-throttle-anomaly "3/5/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:write-latency "84/100/60/2/Maximum/2/GreaterThanThreshold/ignore" Yes Yes
autoalarm:write-latency-anomaly "-/-/60/2/Average/2/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:yellow-cluster "-/1/300/1/Maximum/1/GreaterThanThreshold/ignore" Yes Yes
autoalarm:red-cluster "-/1/60/1/Maximum/1/GreaterThanThreshold/ignore" Yes Yes

SQS

Tag Default Value Enabled By Default Standard CloudWatch Metrics
autoalarm:age-of-oldest-message "-/-/300/1/Maximum/1/GreaterThanThreshold/ignore" No Yes
autoalarm:age-of-oldest-message-anomaly "-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:empty-receives "-/-/300/1/Sum/1/GreaterThanThreshold/ignore" No Yes
autoalarm:empty-receives-anomaly "-/-/300/1/Sum/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:messages-deleted "-/-/300/1/Sum/1/GreaterThanThreshold/ignore" No Yes
autoalarm:messages-deleted-anomaly "-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:messages-not-visible "-/-/300/1/Maximum/1/GreaterThanThreshold/ignore" No Yes
autoalarm:messages-not-visible-anomaly "-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:messages-received "-/-/300/1/Sum/1/GreaterThanThreshold/ignore" No Yes
autoalarm:messages-received-anomaly "-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:messages-sent "-/-/300/1/Sum/1/GreaterThanThreshold/ignore" No Yes
autoalarm:messages-sent-anomaly "1/1/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:messages-visible "-/-/300/1/Maximum/1/GreaterThanThreshold/ignore" No Yes
autoalarm:messages-visible-anomaly "-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" Yes Yes
autoalarm:sent-message-size "-/-/300/1/Average/1/GreaterThanThreshold/ignore" No Yes
autoalarm:sent-message-size-anomaly "-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" No Yes

Target Groups (TG)

Tag Default Value Enabled By Default Standard CloudWatch Metrics
autoalarm:4xx-count "-/-/60/2/Sum/1/GreaterThanThreshold/ignore" No Yes
autoalarm:4xx-count-anomaly "-/-/60/2/Average/1/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:5xx-count "-/-/60/2/Sum/1/GreaterThanThreshold/ignore" No Yes
autoalarm:5xx-count-anomaly "3/6/60/2/Average/1/GreaterThanUpperThreshold/ignore" Yes Yes
autoalarm:response-time "3/5/60/2/p90/2/GreaterThanThreshold/ignore" No Yes
autoalarm:response-time-anomaly "2/5/300/2/Average/2/GreaterThanUpperThreshold/ignore" No Yes
autoalarm:unhealthy-host-count "-/1/60/2/Maximum/2/GreaterThanThreshold/ignore" Yes Yes

Default Alarm Behavior

AutoAlarm comes with default alarm configurations for various metrics. These default alarms are created when the corresponding tags are not present on the resources. The default alarms are designed to provide basic monitoring out-of-the-box. However, it is recommended to customize the alarms based on your specific requirements.

IAM Role and Permissions

The Lambda execution role requires specific permissions to interact with AWS services. These are created when the CDK project is deployed.:

Limitations

Please refer to the code files provided for more detailed information on the implementation and usage of the AutoAlarm system.