opentelekomcloud / terraform-provider-opentelekomcloud

Terraform OpenTelekomCloud provider
https://registry.terraform.io/providers/opentelekomcloud/opentelekomcloud/latest
Mozilla Public License 2.0
87 stars 78 forks source link

opentelekomcloud_ces_alarmrule: alarm do not trigger #2576

Closed deem1978 closed 2 months ago

deem1978 commented 3 months ago

Hello, I really hope that someone can help me and tell me what I do wrong because I really do not understand. This is my file structure:

.
├── README.md
├── helpers
│   ├── Makefile_terraform
│   └── mv_state
├── modules
│   ├── cloud-eye
│   │   ├── main.tf
│   │   ├── terraform.tf
│   │   └── variables.tf
│   ├── terraform-aws-route53-record
│   │   ├── README.md
│   │   ├── main.tf
│   │   ├── terraform.tf
│   │   └── variables.tf
│   └── terraform-otc-rds
│       ├── README.md
│       ├── data.tf
│       ├── main.tf
│       ├── outputs.tf
│       ├── terraform.tf
│       └── variables.tf
└── projects
    ├── ems_core
    │   ├── environments
    │   │   └── dev
    │   │       ├── Makefile
    │   │       ├── main.tf
    │   │       └── terraform.tf
    │   └── modules
    │       ├── cloud_eye
    │       │   ├── main.tf
    │       │   └── variables.tf
    │       └── rds
    │           ├── main.tf
    │           ├── outputs.tf
    │           └── variables.tf

So I have the module terraform-otc-rds who deploy RDS instances, here are defined some general parameter. At the same directory level I have the module cloud-eye where I define some general parameter for alerting too. Than I have a subdirectory projects: here I define a subdirectory environments (dev, stage, prod) where I define environment specific values for RDS and Cloud Eye and at the same directory level as the projects one I have another subdirectory modules where I define project specific values for RDS and Cloud-Eye.

This is modules/cloud-eye/main.tf:

locals {
  # MS teams email endpoint for alerts
  alerts_endpoint_email = {
    dev   = "9d8fb980.office.onmicrosoft.com@emea.teams.ms"
  }
  # MS teams email endpoint for notifications
  notifications_endpoint_email = {
    dev   = "26053a02.office.onmicrosoft.com@emea.teams.ms",
  }

  smn_topics = {
    alerts        = opentelekomcloud_smn_topic_v2.alerts.id
    notifications = opentelekomcloud_smn_topic_v2.notifications.id
  }

  dimension_name = {
    mysql      = "rds_instance_id"
    postgresql = "postgresql_instance_id"
  }

  module_resource_tags = {
  environment = replace(basename(abspath(path.root)), "_", "-")
  owner       = "aa",
  terraform   = "true"
  }
  resource_tags = local.module_resource_tags
}

resource "opentelekomcloud_ces_alarmrule" "high_cpu_usage_test" {
  alarm_name  = "${var.instance_name}_high_cpu_usage"
  alarm_level = var.alarm_level
  alarm_enabled = var.alarm_enabled
  alarm_action_enabled = var.alarm_action_enabled

  metric {
    namespace   = "SYS.RDS"
    metric_name = "rds001_cpu_util"

    dimensions {
      name  = "rds_instance_id"
      value = "${var.instance_id}"
    }
  }

  #metric {
  #  namespace   = "SYS.RDS"
  #  metric_name = "rds001_cpu_util"
    # create multiple dimensions for multi-az deployments
  #  dynamic "dimensions" {
      # FIXME: see above, enforce single, until OTC fixes multi-dimensions
      # for_each = local.is_multiaz ? [0, 1] : [0]
  #    for_each = local.is_multiaz ? [0] : [0]
  #    content {
  #      name  = local.dimension_name[lower(var.db_type)]
  #      value = opentelekomcloud_rds_instance_v3.main.nodes[dimensions.value].id
  #    }
  #  }

  condition {
    period                 = 300
    filter                 = "max"
    comparison_operator    = ">"
    value                  = var.high_cpu_usage_threshold
    unit                   = "%"
    count                  = 1
  }
  alarm_actions {
    type              = "notification"
    notification_list = [opentelekomcloud_smn_topic_v2.alerts.id]
  }
  ok_actions {
    type              = "notification"
    notification_list = [opentelekomcloud_smn_topic_v2.alerts.id]
  }
}

# alerting channel
resource "opentelekomcloud_smn_topic_v2" "alerts" {
  name         = "${var.instance_name}_alerts"
  display_name = "Alerting topic for RDS ${var.instance_id}"
  tags         = local.resource_tags
}

resource "opentelekomcloud_smn_subscription_v2" "alerts" {
  topic_urn = opentelekomcloud_smn_topic_v2.alerts.id
  endpoint  = local.alerts_endpoint_email.dev #[split("-", local.environment)[0]] # allow for env names like dev-single
  protocol  = "email"
}

# notifications channel
resource "opentelekomcloud_smn_topic_v2" "notifications" {
  name         = "${var.instance_name}_notifications"
  display_name = "Notifications topic for RDS ${var.instance_id}"
  tags         = local.resource_tags
}

resource "opentelekomcloud_smn_subscription_v2" "notifications" {
  topic_urn = opentelekomcloud_smn_topic_v2.notifications.id
  endpoint  = local.notifications_endpoint_email.dev # [split("-", local.environment)[0]] # allow for env names like dev-single
  protocol  = "email"
}

modules/cloud-eye/variables.tf:

variable "alarm_level" {
  type = number
  description = "alarm_level: 1-4, represents: critical, major, minor, informational"
}

variable "high_cpu_usage_threshold" {
  type = number
  description = "high_cpu_usage_threshold: threshold for high cpu usage"
}

variable "alarm_enabled" {
  type = bool
  default = true
}

variable "alarm_action_enabled" {
  type = bool
  default = true
}

variable "instance_id" {
  description = "The ID of the instance to monitor"
  type        = string
}

variable "instance_name" {
  description = "The name of the instance to monitor"
  type        = string 
}

projects/ems_core/modules/cloud_eye/main.tf:

module "name" {
  source = "../../../../modules/cloud-eye"

  alarm_level              = var.alarm_level
  high_cpu_usage_threshold = var.high_cpu_usage_threshold
  alarm_enabled            = var.alarm_enabled
  alarm_action_enabled     = var.alarm_action_enabled
  instance_id = var.instance_id
  instance_name = var.instance_name
}

projects/ems_core/modules/cloud_eye/variables.tf:

variable "alarm_level" {
  type = number
  description = "alarm_level: 1-4, represents: critical, major, minor, informational"
}

variable "high_cpu_usage_threshold" {
  type = number
  description = "high_cpu_usage_threshold: threshold for high cpu usage"
}

variable "alarm_enabled" {
  type = bool
  default = true
}

variable "alarm_action_enabled" {
  type = bool
  default = true
}

variable "instance_id" {
  description = "The ID of the instance to monitor"
  type        = string
}

variable "instance_name" {
  description = "The name of the instance to monitor"
  type        = string 
}

projects/ems_core/environments/dev/main.tf:

module "rds" {

source = "../../modules/rds"

  # configuration for service module
  flavor                   = "rds.mysql.s1.medium"
  db_version               = "8.0"
  db_parameters            = {} # optional: set env specific db parameters
  #alarm_level              = 2
  #high_cpu_usage_threshold = 2
  #alarm_enabled            = true
  #alarm_action_enabled     = true

  volume_type = "COMMON"

  allow_vpn_and_office_ingress = true
  custom_allowed_ingress = {
    private_subnet_1 = "10.225.43.0/24"
    private_subnet_2 = "10.225.44.0/24"
    private_subnet_3 = "10.225.45.0/24"
  }
}

module "cloud_eye" {
  source = "../../modules/cloud_eye"

  # configuration for service module
  alarm_level              = 2
  high_cpu_usage_threshold = 2
  alarm_enabled            = true
  alarm_action_enabled     = true
  instance_id              = module.rds.instance_id
  instance_name            = module.rds.instance_name
}

# outputs
output "rds_instance_name" {
  value = module.rds.instance_name
}

output "rds_instance_id" {
  value = module.rds.instance_id
}

output "rds_instance_ip" {
  value = module.rds.instance_ip
}
output "rds_instance_az" {
  description = "rds instance az"
  value       = module.rds.instance_az
}

output "db_user_name" {
  description = "default db username"
  value       = module.rds.db_user_name
}

output "mysql_helper" {
  description = "mysql connect string with initial password"
  value       = module.rds.mysql_helper
  sensitive   = true
}
output "op_cli_create" {
  description = "op cli command to create 1password item, show with: tf output op_cli_create"
  value       = module.rds.op_cli_create
  sensitive   = true
}

The alert, the topic and the subscription are created but no alarm trigger. I have created manually an alarm (that trigger) with the same values and there is some difference as far as I can see. alarm with terraform: the ID is correct but no instance name and ip are there: Screenshot 2024-07-02 at 08 23 13 Under "Monitored Object => Add Object" there is no istance: Screenshot 2024-07-02 at 08 23 27

Same alert created manually (that one works): Screenshot 2024-07-02 at 08 23 50

Screenshot 2024-07-02 at 08 24 00

Could someone be so kind to have a look into my issue and explain me where I am wrong? Thanks in advance, David

deem1978 commented 3 months ago

I noticed something strange, if I change the condition block in order to apply the average filter:

locals {
  module_resource_tags = {
  environment = replace(basename(abspath(path.root)), "_", "-")
  owner       = "aa",
  terraform   = "true"
  }
  resource_tags = local.module_resource_tags
}

resource "opentelekomcloud_ces_alarmrule" "high_cpu_usage_test" {
  alarm_name  = "${var.instance_name}_high_cpu_usage"
  alarm_level = var.alarm_level
  alarm_enabled = var.alarm_enabled
  alarm_action_enabled = var.alarm_action_enabled

  metric {
    namespace   = var.metric_namespace
    metric_name = var.metric_name

    dimensions {
      name  = var.dimension_name
      value = "${var.instance_id}"
    }
  }

  condition {
    period                 = 1
    filter                 = "average"
    comparison_operator    = ">"
    value                  = var.high_cpu_usage_threshold
    unit                   = "%"
    count                  = 5
    alarm_frequency        = 300
  }
  alarm_actions {
    type              = "notification"
    notification_list = [opentelekomcloud_smn_topic_v2.alerts.id]
  }
  ok_actions {
    type              = "notification"
    notification_list = [opentelekomcloud_smn_topic_v2.alerts.id]
  }
}

# alerting channel
resource "opentelekomcloud_smn_topic_v2" "alerts" {
  name         = "${var.instance_name}_alerts"
  display_name = "Alerting topic for Team A&A RDS ${var.instance_name}"
  tags         = local.resource_tags
}

resource "opentelekomcloud_smn_subscription_v2" "alerts" {
  topic_urn = opentelekomcloud_smn_topic_v2.alerts.id
  endpoint  = var.alerts_endpoint_email
  protocol  = "email"
}

than the metric name is completely absent, and instead of "avarage" it display "raw data": Screenshot 2024-07-03 at 10 49 30

anton-sidelnikov commented 3 months ago

Hi @deem1978, i will create bug in internal jira, sorry but ces api is really raw, some parts described in doc not works at all. i can't help at this time. Probably v2.0 will released soon, with necessary fixes.

anton-sidelnikov commented 3 months ago

Internal issue: https://jira.tsi-dev.otc-service.com/browse/BM-5382

deem1978 commented 3 months ago

Hi @anton-sidelnikov and thank you for reply, but do you see some configuration/definition or design error in my config? I mean if not, then ok I agree with you, the CES is really buggy, but it should be not (IMHO) released as "usable" resource if it is impossible to use it...

deem1978 commented 2 months ago

I just found out today that the problem was the dimension name: by setting it to dimension_name = "rds_cluster_id" everything works as it should. I find it out because I have imported the manually created alert in my state, then running terraform plan showed me that this parameter was different, in my state was "rds_instance_id" but on the real alarm on the infrastructure was ""rds_cluster_id". So I changed the value to all my alarms and everything started to work. As far as I have seen, this value is nowhere reported in the documentation.

anton-sidelnikov commented 2 months ago

Hi @deem1978 , hm, great, thanks for investigation. I hope qa guys will fix docu soon.

I didn't find errors in your configurations looks good.