AutoAlarm is an AWS Lambda-based automation tool designed to dynamically manage CloudWatch alarms for ALBs, EC2 Instances, OpenSearch, SQS, and Target Groups based on instance states and specific tag values. The project uses AWS SDK for JavaScript v3, the AWS CDK for infrastructure deployment, and is integrated with AWS Lambda and CloudWatch for automated cloud observability.
The AutoAlarm project is designed to be deployed with minimal configuration, creating all necessary AWS resources for full functionality. Upon deployment, the project automatically provisions the following components:
This architecture ensures that AutoAlarm can monitor and manage resources out-of-the-box, including ALBs, EC2 instances, OpenSearch domains, SQS queues, and Target Groups. The system is fully event-driven, dynamically responding to state and tag changes across these resources.
Before you begin, ensure you have the following:
To set up and deploy the AutoAlarm project, follow these steps:
Clone the Repository
Start by cloning the project repository to your local machine:
git clone https://github.com/truemark/autoalarm.git
cd autoalarm
pnpm install
export AWS_REGION=<region>
export AWS_ACCESS_KEY_ID="<access-key-id"
export AWS_SECRET_ACCESS_KEY="<secret-access-key>"
export AWS_SESSION_TOKEN="<aws-session-token>"
cdk bootstrap
pnpm build
cd cdk ; cdk deploy AutoAlarm
Amazon CloudWatch is utilized for monitoring and alerting. CloudWatch alarms are created, updated, or deleted by the Lambda function to track various metrics such as CPU utilization, memory usage, storage usage, ALB metrics, and Target Group metrics. CloudWatch Logs are also used to store log data generated by the Lambda function for debugging and auditing purposes.
Amazon EC2 is the primary service monitored by AutoAlarm. The Lambda function responds to state change notifications and tag change events for EC2 instances, creating or updating alarms based on the instance's state and tags.
Amazon EventBridge is used to route events to the Lambda function. Rules are set up to listen for specific events such as state changes, tag changes, and other resource events. These events trigger the Lambda function to perform the necessary alarm management actions. Additionally when using TrueMark's Enterprise Operation center eventbridge is monitored for triggered alarms eliminating the need to send them to a specific SNS topic.
Amazon SQS is used as a dead-letter queue for the Lambda function. If the Lambda function fails to process an event, the event is sent to an SQS queue for further investigation and retry.
ELB is monitored by AutoAlarm for events related to Application Load Balancers (ALBs) and Target Groups. The Lambda function creates, updates, or deletes alarms for ALB metrics and target group metrics based on events and tags.
OS is monitored by AutoAlarm for events related to OpenSearch Service. The Lambda function creates, updates, or deletes alarms for OS metrics based on events and tags.
IAM is used to define roles and policies that grant the necessary permissions to the Lambda function. These roles allow the function to interact with other AWS services such as CloudWatch, EC2, AMP, SQS, and EventBridge.
AWS Lambda is used to run the main AutoAlarm function, which processes service and tag events in addition to managing alarms. The Lambda function is responsible for handling the logic to create, update, or delete CloudWatch alarms and Prometheus rules based on tags and state changes.
The system is event-driven, responding to EC2 state change notifications and tag modification events. To manage alarms , ensure your supported resources are tagged according to the schema defined below.
Tags are used to customize CloudWatch alarms for various AWS services managed by AutoAlarm. By applying specific tags to resources such as EC2 instances, ALBs, Target Groups, SQS, and OpenSearch, you can define custom thresholds, evaluation periods, and other parameters for both static threshold and anomaly detection alarms. The following sections outline the default configurations and explain how you can modify them using these tags.
AutoAlarm comes with predefined default values for various alarms. These defaults are designed to provide general
monitoring out-of-the-box. However, it is crucial that any enabled alarms are reviewed to ensure they align with the
specific needs of your application and environment. Default alarms can be created by setting the autoalarm:enabled
tag
to true
on the resource.
When setting up non-default alarms with tags, you must provide at least the first two values (warning and critical thresholds) for the tag to function correctly. If these thresholds are not supplied, the alarm will not be created unless defaults are defined.
The following schema is used to define tag values for all tags:
Warning Threshold / Critical Threshold / Period / Evaluation Periods / Statistic / Datapoints to Alarm / ComparisonOperator / Missing Data Treatment
Example:
autoalarm:cpu=80/95/60/5/Maximum/5/GreaterThanThreshold/ignore
All Anomaly alarm tags contain 'anomaly' in tag name.
Static Threshold Alarms:
Anomaly Detection Alarms:
You can use the following statistics for alarms - https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Statistics-definitions.html.
Sum/SampleCount
during the specified period.p95
is the 95th percentile
and means that 95 percent of the data within the period is lower than this value, and 5 percent of the data is higher
than this value. Percentiles help you get a better understanding of the distribution of your metric data.tm90
calculates the average
after removing the 10% of data points with the highest values. TM(2%:98%)
calculates the average after removing the 2%
lowest data points and the 2% highest data points. TM(150:1000)
calculates the average after removing all data points
that are lower than or equal to 150, or higher than 1000.TM(25%:75%)
.wm98
calculates the average while treating the 2% of the highest values to be
equal to the value at the 98th percentile. WM(10%:90%)
calculates the average while treating the highest 10% of data
points to be the value of the 90% boundary, and treating the lowest 10% of data points to be the value of the 10% boundary.PR(:300)
returns the
percentage of data points that have a value of 300 or less. PR(100:2000)
returns the percentage of data points that
have a value between 100 and 2000. Percentile rank is exclusive on the lower bound and inclusive on the upper bound.tc90
returns the number of data points not including any data points that fall in the highest 10% of the values.
TC(0.005:0.030)
returns the number of data points with values between 0.005 (exclusive) and 0.030 (inclusive).(Trimmed Mean) * (Trimmed count)
. For example, ts90
returns the sum of the data points not including
any data points that fall in the highest 10% of the values. TS(80%:)
returns the sum of the data point values, not
including any data points with values in the lowest 80% of the range of values.Static Threshold Alarms
GreaterThanOrEqualToThreshold
GreaterThanThreshold
LessThanThreshold
LessThanOrEqualToThreshold
Anomaly Detection Alarms
GreaterThanUpperThreshold
LessThanLowerOrGreaterThanUpperThreshold
LessThanLowerThreshold
Threshold values that contain '-' are undefined and will default to not creating the alarm if the warning and critical threshold values are not provided in the tag value when setting the tag on the resource.
Tag | Default Value | Enabled By Default | Standard CloudWatch Metrics |
---|---|---|---|
autoalarm:4xx-count |
"-/-/60/2/Sum/2/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:4xx-count-anomaly |
"2/5/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:5xx-count |
"-/-/60/2/Sum/2/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:5xx-count-anomaly |
"2/5/300/2/Average/2/GreaterThanUpperThreshold/ignore" | Yes | Yes |
autoalarm:request-count |
"-/-/60/2/Sum/2/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:request-count-anomaly |
"3/5/300/2/Average/2/GreaterThanUpperThreshold/ignore" | No | Yes |
Some Metrics require the CloudWatch Agent to be installed on the host.
Tag | Default Value | Enabled By Default | Standard CloudWatch Metrics |
---|---|---|---|
autoalarm:cpu |
"95/98/60/5/Maximum/5/GreaterThanThreshold/ignore" | Yes | Yes |
autoalarm:cpu-anomaly |
"2/5/60/5/Average/5/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:memory |
"95/98/60/10/Maximum/10/GreaterThanThreshold/ignore" | Yes | No (Requires CloudWatch Agent Install on Host) |
autoalarm:memory-anomaly |
"2/5/300/2/Average/2/GreaterThanUpperThreshold/ignore" | No | No (Requires CloudWatch Agent Install on Host) |
autoalarm:storage |
"90/95/60/2/Maximum/1/GreaterThanThreshold/ignore" | Yes | No (Requires CloudWatch Agent Install on Host) |
autoalarm:storage-anomaly |
"2/3/60/2/Average/1/GreaterThanUpperThreshold/ignore" | No | No (Requires CloudWatch Agent Install on Host) |
Tag | Default Value | Enabled By Default | Standard CloudWatch Metrics |
---|---|---|---|
autoalarm:4xx-errors |
"100/300/300/1/Sum/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:4xx-errors-anomaly |
"-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:5xx-errors |
"10/50/300/1/Sum/1/GreaterThanThreshold/ignore" | Yes | Yes |
autoalarm:5xx-errors-anomaly |
"-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:cpu |
"98/98/300/1/Maximum/1/GreaterThanThreshold/ignore" | Yes | Yes |
autoalarm:cpu-anomaly |
"2/2/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:iops-throttle |
"5/10/300/1/Sum/1/GreaterThanThreshold/ignore" | Yes | Yes |
autoalarm:iops-throttle-anomaly |
"-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:jvm-memory |
"85/92/300/1/Maximum/1/GreaterThanThreshold/ignore" | Yes | Yes |
autoalarm:jvm-memory-anomaly |
"-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:read-latency |
"0.03/0.08/60/2/Maximum/2/GreaterThanThreshold/ignore" | Yes | Yes |
autoalarm:read-latency-anomaly |
"2/6/300/2/Average/2/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:search-latency |
"1/2/300/2/Average/2/GreaterThanThreshold/ignore" | Yes | Yes |
autoalarm:search-latency-anomaly |
"-/-/300/2/Average/2/GreaterThanUpperThreshold/ignore" | Yes | Yes |
autoalarm:snapshot-failure |
"-/1/300/1/Sum/1/GreaterThanOrEqualToThreshold/ignore" | Yes | Yes |
autoalarm:storage |
"10000/5000/300/2/Average/2/LessThanThreshold/ignore" | Yes | Yes |
autoalarm:storage-anomaly |
"2/3/300/2/Average/2/GreaterThanUpperThreshold/ignore" | Yes | Yes |
autoalarm:throughput-throttle |
"40/60/60/2/Sum/2/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:throughput-throttle-anomaly |
"3/5/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:write-latency |
"84/100/60/2/Maximum/2/GreaterThanThreshold/ignore" | Yes | Yes |
autoalarm:write-latency-anomaly |
"-/-/60/2/Average/2/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:yellow-cluster |
"-/1/300/1/Maximum/1/GreaterThanThreshold/ignore" | Yes | Yes |
autoalarm:red-cluster |
"-/1/60/1/Maximum/1/GreaterThanThreshold/ignore" | Yes | Yes |
Tag | Default Value | Enabled By Default | Standard CloudWatch Metrics |
---|---|---|---|
autoalarm:age-of-oldest-message |
"-/-/300/1/Maximum/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:age-of-oldest-message-anomaly |
"-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:empty-receives |
"-/-/300/1/Sum/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:empty-receives-anomaly |
"-/-/300/1/Sum/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:messages-deleted |
"-/-/300/1/Sum/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:messages-deleted-anomaly |
"-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:messages-not-visible |
"-/-/300/1/Maximum/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:messages-not-visible-anomaly |
"-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:messages-received |
"-/-/300/1/Sum/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:messages-received-anomaly |
"-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:messages-sent |
"-/-/300/1/Sum/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:messages-sent-anomaly |
"1/1/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:messages-visible |
"-/-/300/1/Maximum/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:messages-visible-anomaly |
"-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" | Yes | Yes |
autoalarm:sent-message-size |
"-/-/300/1/Average/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:sent-message-size-anomaly |
"-/-/300/1/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
Tag | Default Value | Enabled By Default | Standard CloudWatch Metrics |
---|---|---|---|
autoalarm:4xx-count |
"-/-/60/2/Sum/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:4xx-count-anomaly |
"-/-/60/2/Average/1/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:5xx-count |
"-/-/60/2/Sum/1/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:5xx-count-anomaly |
"3/6/60/2/Average/1/GreaterThanUpperThreshold/ignore" | Yes | Yes |
autoalarm:response-time |
"3/5/60/2/p90/2/GreaterThanThreshold/ignore" | No | Yes |
autoalarm:response-time-anomaly |
"2/5/300/2/Average/2/GreaterThanUpperThreshold/ignore" | No | Yes |
autoalarm:unhealthy-host-count |
"-/1/60/2/Maximum/2/GreaterThanThreshold/ignore" | Yes | Yes |
AutoAlarm comes with default alarm configurations for various metrics. These default alarms are created when the corresponding tags are not present on the resources. The default alarms are designed to provide basic monitoring out-of-the-box. However, it is recommended to customize the alarms based on your specific requirements.
autoalarm:enabled
tag is set to true
on the resource.autoalarm:enabled
tag to
false
on the resource.The Lambda execution role requires specific permissions to interact with AWS services. These are created when the CDK project is deployed.:
EC2 and CloudWatch:
ec2:DescribeInstances
, ec2:DescribeTags
, cloudwatch:PutMetricAlarm
, cloudwatch:DeleteAlarms
, cloudwatch:DescribeAlarms
, cloudwatch:ListMetrics
*
CloudWatch Logs:
logs:CreateLogGroup
, logs:CreateLogStream
, logs:PutLogEvents
*
Elasic Load Balancing:
elasticloadbalancing:DescribeLoadBalancers
, elasticloadbalancing:DescribeTargetGroups
, elasticloadbalancing:DescribeTags
, elasticloadbalancing:DescribeTargetHealth
*
SQS:
sqs:GetQueueAttributes
, sqs:ListQueues
, sqs:ListQueueTags
, sqs:TagQueue
*
OpenSearch:
es:DescribeElasticsearchDomain
, es:ListTags
, es:ListDomainNames
*
Please refer to the code files provided for more detailed information on the implementation and usage of the AutoAlarm system.