Easy to comprehend template for "Alert rule group"

sammit20 commented 2 years ago

Hello Team,

It would be great to have easy-to-comprehend templates for creating alert rules, something similar to yaml contents https://registry.terraform.io/providers/inuits/cortex/latest/docs/resources/rules. Or maybe something like this:

name: myalert rule folder: group_name: query 1: expr: datasource: range: query 2: expr: datasource: range: condition: range: evaluator Annotations: description: summary: labels: key: value

That makes a little less overhead in understanding what the alert rule is from the manifest itself.

alexweav commented 2 years ago

This is a wider effort that we are tracking internally, and has existed for some time. This isn't just purely a terraform thing - same goes for the .yaml provisioning, and the API itself.

Really, it's just that grafana's representation of an alert rule is really large. This is the result of a trade-off, it increases in size because of its flexibility, as it can query any arbitrary datasource.

One single model has difficulty covering all cases, as not every datasource is built around a query string. Consider the Cloudwatch/Stackdriver datasources, there's not a single query field but rather the result of a number of drop-downs.

What we are looking into is how we can have very simple, targeted rule definitions, but specific to some common datasource types. Users who need the flexibility can then fall back to the generic struct we have now. But, this effort spans a few different systems including Terraform, so it's not quite there yet.

eraac commented 1 year ago

We face the same issue as we starting to use Terraform to manage our alerts now, we have multi datasources (gcp/aws/bigquery/prometheus) with more than 150 alerts.

Coding the alert directly with the grafana_rule_group resource was impossible (nearly 200 lines/alert), so we have created multiple modules to simplify the alert creation. We have one module per model of datasource query + one module for the grafana expression + one module per datasource (which aggregate the others modules).

It was difficult to write and there are lot of complexity (because of the mutliple modules) but at least the usage are simple and reduce to the strict minimum.

Maybe the first solution to implement to help grafana users with Terraform is to provide the model for the query of each datasource ... is actually a pain to discover the model and understand it, because nothing is documented in Grafana ...

obounaim commented 1 year ago

@Eraac Would you like please to share an example of how to use grafana_rule_group with CloudWatch datasource. Thanks

eraac commented 1 year ago

Sure @obounaim the model for the CloudWatch query look like this, this can be utilized inside the model attribute https://registry.terraform.io/providers/grafana/grafana/latest/docs/resources/rule_group#model

JSON

```hcl locals { model = { refId = var.ref_id, intervalMs = coalesce(var.interval_milliseconds, 1000) maxDataPoints = coalesce(var.max_data_points, 43200) alias = var.alias, dimensions = var.dimensions, expression = var.expression, id = var.id, matchExact = coalesce(var.match_exact, true), metricName = var.metric_name, namespace = var.namespace, period = var.period, region = var.region, statistic = coalesce(var.statistic, "Average"), logGroupNames = var.log_group_names, metricEditorMode = var.metric_editor_mode, metricQueryType = var.metric_query_type, queryMode = coalesce(var.query_mode, "Metrics"), sql = var.sql, sqlExpression = var.sql_expression, # statsGroups :shrug: -> can figure out the usage from the interface } } ```

variable.tf

```hcl variable "ref_id" { description = "Reference name for the query" type = string default = "A" } variable "interval_milliseconds" { description = "Number of milliseconds between each points of the timeserie. Use period instead, this attribute is only here because Grafana set it" type = number default = null } variable "max_data_points" { description = "Maximun number of points for the timeseries" type = number default = null } variable "alias" { # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/graph-dynamic-labels.html description = "Change time series legend name using Dynamic labels. See documentation for details" type = string default = null } variable "dimensions" { description = "A dimension is a name/value pair that is part of the identity of a metric" type = map(string) default = {} } variable "expression" { # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-search-expressions.html description = "Search expressions are a type of math expression that you can add to CloudWatch graphs. Used by search metrics or logs" type = string default = null } variable "id" { description = "ID can be used to reference other queries in math expressions" type = string default = null } variable "match_exact" { description = "Only show metrics that exactly match all defined dimensions names" type = bool default = null } variable "metric_name" { # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/viewing_metrics_with_cloudwatch.html description = "Name of the metric to retrieve" type = string default = null } variable "namespace" { description = "A namespace is a container for CloudWatch metrics. Metrics in different namespaces are isolated from each other, so that metrics from different applications are not mistakenly aggregated into the same statistics" type = string default = null } variable "period" { description = "Minimal interval between two points in seconds" type = string default = null } variable "region" { description = "Region to call for CloudWatch" type = string default = null } variable "statistic" { # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Statistic # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Statistics-definitions.html description = "Statistics are metric data aggregations over specified periods of time" type = string # Average, Sum, Minimum, Maximum, SampleCount default = null } variable "metric_editor_mode" { description = "Determine the editor mode (builder or code)" type = number default = null # values: 0 -> builder | 1 -> code } variable "sql_expression" { description = "Raw SQL expression to pass to CloudWatch to retrieve the timeseries. Don't forget to set 'metric_editor_mode' to 1" type = string default = null } variable "metric_query_type" { # https://grafana.com/docs/grafana/latest/datasources/aws-cloudwatch/#metrics-query-editor description = "The type of query to build" type = number # Metrics Query in the CloudWatch plugin is what is referred to as Metric Insights in the AWS console default = null # values: 0 -> metric search | 1 -> metric query } variable "query_mode" { description = "Determine if we query cloudwatch metrics or cloudwatch logs" type = string default = null # values: "Metrics", "Logs" } variable "log_group_names" { description = "Name of the logs group to read from" type = list(string) default = null } variable "sql" { description = "Same as sql_expression, but for the builder. Use sql_expression instead" type = object({}) # structure is too difficult default = null } ```

output.tf

```hcl output "model" { value = jsonencode(local.model) } output "ref_id" { value = var.ref_id } ```

obounaim commented 1 year ago

Thanks @Eraac It seems to be working however it seems that the conditions is missing of type "expression". I tried to find the json systax for it however I was not able to in the Grafana documentation.

eraac commented 1 year ago

@obounaim indeed, here the module we have made for handling the expression model

JSON

```hcl locals { model = { type = coalesce(var.type, "classic_conditions"), refId = var.ref_id, intervalMs = coalesce(var.interval_milliseconds, 1000) maxDataPoints = coalesce(var.max_data_points, 43200) # math, reduce, resample expression = var.expression, # reduce reducer = var.reducer, settings = var.type != "reduce" ? null : { # for strict mode, the mode is empty string mode = coalesce(var.reduce_mode, "strict") == "strict" ? "" : var.reduce_mode, } # resample downsampler = var.down_sampler upsampler = var.up_sampler window = var.window # classic_conditions conditions = [ for v in var.conditions : { evaluator = { params = v.evaluator_params, type = v.evaluator_type, }, operator = { type = v.operator_type, }, query = { params = [v.query_ref_id_target], } reducer = { type = v.reducer_type, } } ] } } ```

variable.tf

```hcl variable "type" { description = "Type of the query (classic_conditions, math, reduce, resample)" type = string default = "classic_conditions" # classic_conditions, math, reduce, resample } variable "ref_id" { description = "Name of the query" type = string default = "Z" } variable "interval_milliseconds" { description = "Number of milliseconds between each points of the timeserie. Use period instead, this attribute is only here because Grafana set it" type = number default = null } variable "max_data_points" { description = "Maximun number of points for the timeseries" type = number default = null } variable "expression" { description = "Must be the ref_id of the input for reduce and resample, for math is the formula" type = string default = null } variable "reducer" { description = "The function to apply on time series to reduce it (mean, min, max, sum, count, last)" type = string default = null # values: "mean", "min", "max", "sum", "count", "last" } variable "reduce_mode" { description = "strict: Result can be NaN if series contains non-numeric data | dropNN: Drop NaN, +/- and null from input series before reducing | replaceNN: Replace NaN, +/-Inf and null with a constant before reducing (variable 'reduce_replace_with')" type = string default = null # values: "dropNN", "replaceNN", "strict" } variable "reduce_replace_with" { description = "When reduce_mode is 'replaceNN', use this value to replace all the NaN, +/-Inf and null values" type = number default = null } variable "down_sampler" { description = "The reduction function to use when there are more than one data point per window sample (min, max, mean, sum)" type = string default = null # values: min, max, mean, sum } variable "up_sampler" { description = "The method to use to fill a window sample that has no data points. pad: fills with the last know value | backfill: with next known value | fillna: to fill empty sample windows with NaNs" type = string default = null # values: pad, backfilling, fillna } variable "window" { description = "The duration of time to resample to, for example 10s. Units may be s seconds, m for minutes, h for hours, d for days, w for weeks, and y of years" type = string default = null } variable "conditions" { description = "List of conditions to fire the alert" type = list(object({ evaluator_params = list(number), # 1 param for lt/gt and 2 params for outside_range/within_range, 0 for no_value evaluator_type = string, # gt, lt, outside_range, within_range, no_value operator_type = optional(string, "and"), # for multiple conditions query_ref_id_target = optional(string, "A"), reducer_type = string, # sum, min, max, count, last, median, avg, count_non_null, diff, diff_abs, percent_diff, percent_diff_abs })) default = null } ```

output.tf

```hcl output "model" { value = jsonencode(local.model) } output "ref_id" { value = var.ref_id } ```

obounaim commented 1 year ago

Thanks @Eraac it works great. One more question that is maybe out of the scoop of this issue.

Is there a way to create "rule" argument in "grafana_rule_group" resource automatically using a loop like for_each ? I am aware the meta argument for_each applies to resouce, are you aware of something similar that can be used for an argument ?

example :


resource "grafana_rule_group" "my_alert_rule" {
    name = "My Rule Group"
    folder_uid = grafana_folder.rule_folder.uid
    interval_seconds = 240
    org_id = 1

    for_each = toset( ["rule1", "rule2", "rule3", "rule4"] )
    rule {
        name = each.key

```}
}

eraac commented 1 year ago

@obounaim https://developer.hashicorp.com/terraform/language/expressions/dynamic-blocks

greatvovan commented 1 year ago

I understand the trade-offs mentioned by @alexweav. It is really hard to maintain the Terraform-native definitions for every single supported data source, that, in the meantime, may change according to their evolution pace.

On practice though, you do not need that tons of supported data sources, you use just few. It seems totally possible that you create parameters.tf file with a summary of what makes every alert rule unique, while templating into the proper format (including JSON) on the final stage. The structure can be as @sammit20 suggested or simpler/harder depending on your needs. You write this thing once and then reuse it for every alert rule. It is impossible though to create an ideal solution for all, and every user must do it on their own based on what make sense for them.

grafana / terraform-provider-grafana

Easy to comprehend template for "Alert rule group" #669