hashicorp / terraform-provider-nomad

Terraform Nomad provider
https://registry.terraform.io/providers/hashicorp/nomad/latest
Mozilla Public License 2.0
144 stars 101 forks source link

Terraform deletes the nomad volume (unchanged) to recreate. #364

Closed blmhemu closed 1 year ago

blmhemu commented 1 year ago

Provider 2.0.0 beta 1 Nomad 1.6.1 Terraform 1.5.3

resource "nomad_csi_volume" "alpine-caddy" {
  depends_on   = [data.nomad_plugin.nfs]
  plugin_id    = data.nomad_plugin.nfs.plugin_id
  name         = "alpine-caddy"
  volume_id    = "alpine-caddy"
  capacity_min = "0GiB"
  capacity_max = "0GiB"
  capability {
    access_mode     = "multi-node-multi-writer"
    attachment_mode = "file-system"
  }
}

Bug 1: When I reapply, tf tries to delete the volume and create a new one. This is a bug because tf should not delete an existing unchanged volume (the resource did not change).


Bug 2: But since jobs are using it, it does not seem to work. A job is using the volume and hence tf fails to delete the volume.


lgfa29 commented 1 year ago

Hi @blmhemu 👋

Thanks for the report and giving the beta version a try 🙂

Bug 1: When I reapply, tf tries to delete the volume and create a new one.

Does it tell which field is caused the recreate to happen?

Bug 2: But since jobs are using it, it does not seem to work. A job is using the volume and hence tf fails to delete the volume.

Hum....yeah, this is an interesting problem but I'm not sure I would consider it a bug. If you try to delete a volume that is being used by a job the delete should fail. If you have the job in TF, and it references the volume, then TF should delete the job first.

blmhemu commented 1 year ago

Does it tell which field is caused the recreate to happen?

I am so sorry 😣 , I forgot to check that. I have since upgraded to the rc1 (was awaiting eagerly 😊) and could no longer repro the issue. Please feel free to close the issue with not enough info (I will re-open if I come across this again)

Hum....yeah, this is an interesting problem but I'm not sure I would consider it a bug. If you try to delete a volume that is being used by a job the delete should fail.

I have modified the volume a bit (added capacity_min = "0") and tried terraform apply. It just waits for the destroy to complete which never happens as the job is running. I am not sure if nomad provides an API to update a volume in-flight but I think not.

If you have the job in TF, and it references the volume, then TF should delete the job first.

This does not happen - the job depends on the volume.

# Caddy
resource "nomad_csi_volume" "caddy" {
  depends_on = [data.nomad_plugin.nfs]
  plugin_id  = data.nomad_plugin.nfs.plugin_id
  name       = "caddy"
  volume_id  = "caddy"
  capability {
    access_mode     = "multi-node-multi-writer"
    attachment_mode = "file-system"
  }
}

resource "nomad_job" "caddy" {
  depends_on = [nomad_csi_volume.caddy]
  jobspec    = file("${path.module}/jobs/caddy.hcl")
}
lgfa29 commented 1 year ago

I have since upgraded to the rc1 (was awaiting eagerly 😊) and could no longer repro the issue. Please feel free to close the issue with not enough info (I will re-open if I come across this again)

Ah ok. There may be a problem with state migration from v1.x to v2.x then. Did the volume exist as a nomad_external_volume before the update?

This does not happen - the job depends on the volume.

Hum...I see. I usually expect the destroy order to be the inverse of create, but maybe that's different for destructive updates where a resource is destroy and then recreated. The job didn't change, so TF may have left it alone.

icyleaf commented 1 year ago

I ran and all my old volumes were destroyed to recreate, data all gone....

blmhemu commented 1 year ago

Ah ok. There may be a problem with state migration from v1.x to v2.x then. Did the volume exist as a nomad_external_volume before the update?

Nope I upgrades from beta1 to rc1 and the resource is the same nomad_csi_volume

lgfa29 commented 1 year ago

Nope I upgrades from beta1 to rc1 and the resource is the same nomad_csi_volume

Ah OK, I think I see what happened. Due to https://github.com/hashicorp/terraform-provider-nomad/issues/362, you likely got an error when trying to create the volume. Since the apply resulted in an error Terraform marked the resource as tainted and so it would try to recreate it on every apply:

$ terraform plan
nomad_csi_volume.a: Refreshing state... [id=test-123]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # nomad_csi_volume.a is tainted, so must be replaced
-/+ resource "nomad_csi_volume" "a" {
      ~ controller_required     = true -> (known after apply)
      ~ controllers_expected    = 1 -> (known after apply)
      ~ controllers_healthy     = 1 -> (known after apply)
      ~ id                      = "test-123" -> (known after apply)
        name                    = "test-123"
      ~ nodes_expected          = 1 -> (known after apply)
      ~ nodes_healthy           = 1 -> (known after apply)
      ~ plugin_provider         = "csi-hostpath" -> (known after apply)
      ~ plugin_provider_version = "v1.11.0" -> (known after apply)
      ~ schedulable             = true -> (known after apply)
      ~ topologies              = [
          - {
              - segments = {
                  - "topology.hostpath.csi/node" = "node-0"
                }
            },
        ] -> (known after apply)
        # (5 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

The key line is this:

nomad_csi_volume.a is tainted, so must be replaced

One way to get around this problem is to run the terraform untaint command on the resource after upgrading to RC1:

$ terraform untaint nomad_csi_volume.a
Resource instance nomad_csi_volume.a has been successfully untainted.

$ terraform plan
nomad_csi_volume.a: Refreshing state... [id=test-123]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

I ran and all my old volumes were destroyed to recreate, data all gone....

I'm sorry that happened to you. Could you provide more details on which steps you took? Did you upgrade the plugin? What was the previous and the new version? Do you have the plan output?

blmhemu commented 1 year ago

Ah OK, I think I see what happened. Due to https://github.com/hashicorp/terraform-provider-nomad/issues/362, you likely got an error when trying to create the volume. Since the apply resulted in an error Terraform marked the resource as tainted and so it would try to recreate it on every apply:

That makes a lot of sense. I vaguely remember the resource being tainted.

icyleaf commented 1 year ago

I'm sorry that happened to you. Could you provide more details on which steps you took? Did you upgrade the plugin? What was the previous and the new version? Do you have the plan output?

I upgraded terraform-nomad version from 1.4.20 to 2.0.0-beta.1 and executed the alias tfa! (terraform apply -auto-approve), something catastrophic happened.

Yes, i saw many volumes will be destroy and recreate.

lgfa29 commented 1 year ago

I upgraded terraform-nomad version from 1.4.20 to 2.0.0-beta.1 and executed the alias tfa! (terraform apply -auto-approve), something catastrophic happened.

I would strongly advise you to stop using that alias and carefully inspecting the plan output before applying, specially considering this was a beta release of major version bump. Additionally, you can use the prevent_destroy lifecycle feature to prevent the destruction of stateful resources. This is something that may be worth highlighting in our documentation.

Yes, i saw many volumes will be destroy and recreate.

But do you still have the plan output? I want to make sure the reason the recreation was the tainted resource.

lgfa29 commented 1 year ago

https://github.com/hashicorp/terraform-provider-nomad/pull/366 adds a warning in our documentation about data loss and the use of prevent_destroy.

Since the root cause seems to have been found and fixed in 2.0.0-rc.1 I'm going to close this issue, but feel free to reachout if there are any other problem.