hkailantzis commented 1 year ago

Terraform Version

Terraform: latest
Terraform Grafana Provider: v1.32.0
Grafana: 9.3.1

Affected Resource(s)

Please list the resources as a list, for example:

grafana_contact_point
grafana_notification_policy

If this issue appears to affect multiple resources, it may be an issue with Terraform's core, so please mention this.

Terraform Configuration Files

resource "grafana_contact_point" "contact_point" {

  for_each = nonsensitive(var.contact_points)

  name = each.value.name

  dynamic "opsgenie" {
    for_each = each.value.opsgenies == null ? tolist([]) : each.value.opsgenies

    content {
      api_key           = var.opsgenie_api_key
      url               = opsgenie.value.url
      override_priority = true
      send_tags_as      = "tags"
      auto_close        = true
    }
  }

  dynamic "email" {
    for_each = each.value.emails == null ? tolist([]) : each.value.emails

    content {
      addresses    = email.value.email_addresses
      single_email = email.value.single_email
    }
  }
}

resource "grafana_notification_policy" "notification_policy" {
  group_by      = ["alertname", "datasource"]
  contact_point = grafana_contact_point.contact_point["opsgenie"].name

  dynamic "policy" {
    for_each = grafana_contact_point.contact_point

    content {
      contact_point = policy.value.name
      continue      = true
      group_by      = []
    }
  }
}

**test.tfvars**

contact_points = {
  opsgenie = {
    name = "OpsgenieTEST"
    opsgenies = [
      {
        url = "opsgenie_url"
      }
    ]
  }
  email = {
    name = "team-email"
    emails = [
      {
        email_addresses = ["team-email@example.com"]
        single_email    = true
      }
    ]
  }
}

When trying to omit the email block, TF plan shows the following:

- resource "grafana_contact_point" "contact_point" {
      - id   = "xxxxxxx" -> null
      - name = "team-email" -> null
      - email {
          - addresses               = [
              - "team-email@example.com",
            ] -> null
          - disable_resolve_message = false -> null
          - settings                = (sensitive value)
          - single_email            = true -> null
          - uid                     = "xxxxxxx" -> null
        }
    }
  # module.generic_contact_points.grafana_notification_policy.notification_policy will be updated in-place
  ~ resource "grafana_notification_policy" "notification_policy" {
        id            = "policy"
        # (2 unchanged attributes hidden)
      ~ policy {
          ~ contact_point = "team-email"" -> "OpsgenieTEST"
            # (3 unchanged attributes hidden)
        }
      - policy {
          - contact_point = "OpsgenieTEST" -> null
          - continue      = true -> null
          - group_by      = [] -> null
          - mute_timings  = [] -> null
        }
    }
Plan: 0 to add, 1 to change, 1 to destroy.

TF apply fails, because TF is trying to destroy the contact_point first, and then proceed with modification of policy. adding [depends_on], doesnt' make any difference. Any hints/ideas would be great. Thanks in advance!

Debug Output

Please provider a link to a GitHub Gist containing the complete debug output: https://www.terraform.io/docs/internals/debugging.html. Please do NOT paste the debug output in the issue; just paste a link to the Gist.

Panic Output

If Terraform produced a panic, please provide a link to a GitHub Gist containing the output of the crash.log.

Expected Behavior

What should have happened?

Actual Behavior

What actually happened?

module.generic_contact_points.grafana_contact_point.contact_point["email"]: Destroying... [id=xxxxxxx]
module.generic_contact_points.grafana_contact_point.contact_point["email"]: Still destroying... [id=xxxxxxx, 10s elapsed]
╷
│ Error: status: 500, body: {"message":"contact point 'team-email' is currently used by a notification policy","traceID":""}
│ 
│ 
╵

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

terraform apply

Important Factoids

Are there anything atypical about your accounts that we should know? For example: Running in EC2 Classic? Custom version of OpenStack? Tight ACLs?

References

Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:

GH-1234

jannisiking commented 1 year ago

Any updates on this issue? Our pipeline keeps failing because of this.

Hronom commented 1 year ago

Hello, any progress on this? This is critical for us, it's hard to manage teams notifications.

I would also ask if it possible to edit independently each of policy, because right now it's upload entire tree for all notification policies for all teams at once. We can still build three based on so called parent_uid(introduce if it not present).

This create circular dependency, since grafana not allow to remove contact point if it used in notification policy as same as you can't create notification policy and include not existing contact point if it not present yet.

Also keep in mind that terraform not works very well in such situations when during creation you need to apply one sequence and during delete of notification policy - another.

alexweav commented 1 year ago

adding [depends_on], doesnt' make any difference.

Could you please provide an example of how depends_on is being used? This is almost exactly the same problem as this one in the AWS provider, a user there mentioned that depends_on seemed to solve it: https://github.com/hashicorp/terraform/issues/20196

Hronom commented 12 months ago

@alexweav it's easy to reproduce. Just setup trough terraform two contact points(grafana_contact_point) that involved in grafana_notification_policy apply this. And then try to remove one and you will get issue whenever you put or not depends_on you will get issue during adding of new grafana_contact_point or during removing it but you will not make it work for both cases removing and adding new. There circular dependency in what grafana has now

GologicGClaudel commented 11 months ago

Hey, I had this issue yesterday and I'm stuck with my contact point now, it doesn't want to destroy. Any news on this ?

yuri-tceretian commented 11 months ago

This problem is not something unique to our case. Provider cannot control the dependency graph the Terraform builds. The only way to help it is to provide hints by using the meta field depends_on available for each resource. However, this does not work when operation delete + modify. There are issues in Terraform repo https://github.com/hashicorp/terraform/issues/20196, https://github.com/hashicorp/terraform/issues/23169, and many others. Those issues seem to be closed via https://github.com/hashicorp/terraform/pull/23252 which was merged back in 2019 but it does not seem to fix the problem after all.

Therefore, I do not think this can be solved on the provider side. I am not an expert in the terraform provider SDK, though. So, if anyone in the community could advise on the way forward I would appreciate it and could work to fix that.

On the server side, this problem is also tricky because it requires the correct order of the mutations to the configuration file. We can add some fuzziness to the server.

Option 1. Add a flag "cascade=true" to the request. If specified, the deletion of contact points will result in the deletion of the notification policies that refer to the contact point.
Option 2. When the contact point is deleted, replace routes to use the default route's contact point (which always exists). This can lead to unexpected consequences if enabled by default. Probably can be enabled via some flag in the request by analogy with option 1.
Option 3. Do not delete the contact point but delete all integrations that are in it. Then when the notification policy is deleted\updated, we delete the contact point that it refers to if it is empty. If the notification policy is not deleted\updated later, the route becomes a sinkhole, and alerts that match it will go nowhere.

GologicGClaudel commented 11 months ago

@yuri-tceretian Thank you ! I removed the grafana_oncall_route resource linked to the contact_point I was trying to remove and it worked like a charm !

Hronom commented 10 months ago

@yuri-tceretian it's still not possible to work correctly with notification policy in Grafana. Management of notification policies this is something needs to be re-architected in Grafana, there no actually issues with terraform - it works as expected.

This is what needs to be done in Grafana: Instead of having one json for all policies graph - you need to allow create each node separately, so we can create/update/delete each leaf independently.

Example

I need to build such graph:

+ grafana-default-email
++ team-a-alerts-slack
++ team-b-alerts-slack
++ team-c-alerts-slack

Current

This is how it looks like with current config for notification policies:

{
  "apiVersion": 1,
  "policies": [
    {
      "orgId": 1,
      "Policy": {
        "receiver": "grafana-default-email",
        "group_by": [
          "..."
        ],
        "routes": [
          {
            "receiver": "team-a-alerts-slack",
            "group_by": [
              "grafana_folder",
              "alertname"
            ],
            "object_matchers": [
              [
                "severity",
                "=",
                "warning"
              ],
              [
                "teamName",
                "=",
                "Team A"
              ]
            ],
            "group_wait": "30s",
            "group_interval": "5m",
            "repeat_interval": "1h"
          },
          {
            "receiver": "team-b-alerts-slack",
            "group_by": [
              "grafana_folder",
              "alertname"
            ],
            "object_matchers": [
              [
                "severity",
                "=",
                "warning"
              ],
              [
                "teamName",
                "=",
                "Team B"
              ]
            ],
            "group_wait": "30s",
            "group_interval": "5m",
            "repeat_interval": "1h"
          },
          {
            "receiver": "team-c-alerts-slack",
            "group_by": [
              "grafana_folder",
              "alertname"
            ],
            "object_matchers": [
              [
                "severity",
                "=",
                "warning"
              ],
              [
                "teamName",
                "=",
                "Team C"
              ]
            ],
            "group_wait": "30s",
            "group_interval": "5m",
            "repeat_interval": "1h"
          }
        ],
        "group_wait": "30s",
        "group_interval": "5m",
        "repeat_interval": "1h"
      }
    }
  ]
}

New

This is how it should be after re-architected:

    {
      "apiVersion": 1,
      "orgId": 1,
      "Policy": {
        "id": 1,
        "receiver": "grafana-default-email",
        "group_by": [
          "..."
        ],
        "group_wait": "30s",
        "group_interval": "5m",
        "repeat_interval": "1h"
      }
    }

    {
      "apiVersion": 1,
      "orgId": 1,
      "Policy": {
        "id": 2,
        "parentPolicyId": 1,
        "receiver": "team-a-alerts-slack",
        "group_by": [
          "grafana_folder",
          "alertname"
        ],
        "object_matchers": [
          [
            "severity",
            "=",
            "warning"
          ],
          [
            "teamName",
            "=",
            "Team A"
          ]
        ],
        "group_wait": "30s",
        "group_interval": "5m",
        "repeat_interval": "1h"
      }
    }

    {
      "apiVersion": 1,
      "orgId": 1,
      "Policy": {
        "id": 3,
        "parentPolicyId": 1,
        "receiver": "team-b-alerts-slack",
        "group_by": [
          "grafana_folder",
          "alertname"
        ],
        "object_matchers": [
          [
            "severity",
            "=",
            "warning"
          ],
          [
            "teamName",
            "=",
            "Team B"
          ]
        ],
        "group_wait": "30s",
        "group_interval": "5m",
        "repeat_interval": "1h"
      }
    }

    {
      "apiVersion": 1,
      "orgId": 1,
      "Policy": {
        "id": 4,
        "parentPolicyId": 1,
        "receiver": "team-c-alerts-slack",
        "group_by": [
          "grafana_folder",
          "alertname"
        ],
        "object_matchers": [
          [
            "severity",
            "=",
            "warning"
          ],
          [
            "teamName",
            "=",
            "Team C"
          ]
        ],
        "group_wait": "30s",
        "group_interval": "5m",
        "repeat_interval": "1h"
      }
    }

hkailantzis commented 10 months ago

for anyone suffering on this, and due to the fact that we were deploying Grafana via Helm Chart, we decided to switch provisioning alert related resources via the upstream Helm Chart itself, via file provisioning/using sidecars, easier to maintain, + 1 deployment pipeline everything in one repo etc, no dependencies between Helm chart/TF workflows... Alert rules, contact points, notification policies and notification templates are supported by upstream Helm chart anyway....

More info about it here: https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/file-provisioning/

The only resources that I would like to have provisioned using this method, but currently missing are teams/users but well...(I went on using an sidecar (https://github.com/kiwigrid/k8s-sidecar) for this to make a HTTP call for user provisioning via Grafana API )

jonnydh commented 6 months ago

This is intended behaviour, per the Terraform Provider, which will always destroy resources before updating, regardless of the inferred dependency graph.

However, this can be solved by adding the following block to your contact point resource:

resource "grafana_contact_point" "contact_points" {
....
  lifecycle {
    create_before_destroy = true
  }
....
}

This works because it forces dependant resources to be updated, before attempting to destroy the contact point. This means that the notification policy will be modified first, then the contact point will be destroyed after (See attached screenshot).

Gotcha: In order for this to work, you'll need to succesfully run an apply before you add the lifecycle argument. This is because you get a chicken and egg scenario, where the lifecycle argument is never added, because it tries to delete the contact point first still. To get round this, we manually jumped on our grafana-alerting instance, and used the grafana-api to delete the current notification policy. Before immediately running an apply.

Source: See the explanation from @jbardin in this issue in the Terraform Provider

Hronom commented 1 month ago

Management of grafana_notification_policy and grafana_contact_point should be revisited.

The solution above not works properly when you manage grafana_contact_point in module, but grafana_notification_policy outside of this module

grafana / terraform-provider-grafana

Contact point can't be deleted because is currently used by a notification policy, 500 error on TF apply #769