hashicorp / terraform-provider-nomad

Terraform Nomad provider
https://registry.terraform.io/providers/hashicorp/nomad/latest
Mozilla Public License 2.0
144 stars 101 forks source link

nomad_job : provider updates state on failure #385

Closed the-nando closed 10 months ago

the-nando commented 1 year ago

Nomad Version

1.6.2+ent

Provider Configuration

Which values are you setting in the provider configuration?

provider "nomad" {
  alias   = "ap-northeast-1"
  address = "https://<addr>"
  region  = "ap-northeast-1"
  ignore_env_vars = {
    NOMAD_NAMESPACE = true
    NOMAD_REGION    = true
  }
}

Affected Resource(s)

Terraform Configuration Files

resource "nomad_job" "test-job" {
  jobspec = file("${path.module}/test-job.hcl")
}
job "test-job" {
  datacenters = ["ap-northeast-1a"]

  namespace = "system"

  group "test" {

    task "test" {
      driver = "docker"

      config {
        image = "alpine:3.18"
        args  = ["tail", "-f", "/dev/null"]
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

After applying the job with a Nomad token with enough permissions, I edit the job's spec and re-run Terraform with a Nomad token which lacks permissions:

module.test.nomad_job.test-job: Refreshing state... [id=test-job]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.test.nomad_job.test-job will be updated in-place
  ~ resource "nomad_job" "test-job" {
      ~ allocation_ids          = [] -> (known after apply)
      ~ datacenters             = [
          - "ap-northeast-1b",
            # (1 unchanged element hidden)
        ]
        id                      = "test-job"
      ~ jobspec                 = <<-EOT
            job "test-job" {
          -   datacenters = ["ap-northeast-1a", "ap-northeast-1b"]
          +   datacenters = ["ap-northeast-1a"]

              namespace = "system"

              group "test" {

                task "test" {
                  driver = "docker"

                  config {
                    image = "alpine:3.18"
                    args  = ["tail", "-f", "/dev/null"]
                  }

                  resources {
                    cpu    = 100
                    memory = 64
                  }
                }
              }
            }
        EOT
      ~ modify_index            = "1172901" -> (known after apply)
        name                    = "test-job"
      ~ region                  = "ap-northeast-1" -> (known after apply)
      ~ type                    = "service" -> (known after apply)
        # (8 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

module.test.nomad_job.test-job: Modifying... [id=test-job]
╷
│ Error: error applying jobspec: Unexpected response code: 403 (Permission denied)
│
│   with module.test.nomad_job.test-job,
│   on ../../../modules/test/main.tf line 78, in resource "nomad_job" "test-job":
│   78: resource "nomad_job" "test-job" {

A subsequent plan runs clean, even though the job itself is still running with the old spec and the state is updated:

~ NOMAD_REGION=ap-northeast-1 nomad job inspect -namespace='system' test-job | jq -r .Job.Datacenters
[
  "ap-northeast-1a",
  "ap-northeast-1b"
]
~  terraform state show module.test.nomad_job.test-job
# module.test.nomad_job.test-job:
resource "nomad_job" "test-job" {
    datacenters             = [
        "ap-northeast-1a",
    ]
    deregister_on_destroy   = true
    deregister_on_id_change = true
    detach                  = true
    hcl1                    = false
    id                      = "test-job"
    jobspec                 = <<-EOT
        job "test-job" {
          datacenters = ["ap-northeast-1a"]

          namespace = "system"

          group "test" {

            task "test" {
              driver = "docker"

              config {
                image = "alpine:3.18"
                args  = ["tail", "-f", "/dev/null"]
              }

              resources {
                cpu    = 100
                memory = 64
              }
            }
          }
        }
    EOT
    modify_index            = "1172901"
    name                    = "test-job"
    namespace               = "system"
    policy_override         = true
    read_allocation_ids     = false
    region                  = "ap-northeast-1"
    task_groups             = [
        {
            count   = 1
            meta    = {}
            name    = "test"
            task    = [
                {
                    driver        = "docker"
                    meta          = {}
                    name          = "test"
                    volume_mounts = []
                },
            ]
            volumes = []
        },
    ]
    type                    = "service"
}
~

Expected Behavior

The state isn't updated.

Actual Behavior

The state is updated and diverges from reality.

lgfa29 commented 10 months ago

Hi @the-nando 👋

This an interesting problem 🤔

I imagine the token you used the second time had read permission, but not write, so Terraform was able to refresh the state, but not apply the change. And since it has already the state refreshed, it thinks there's nothing to do the second time.

Unfortunately the provider doesn't have any control over this flow. Terraform calls the Read and Update methods when it needs to.

One possibility could be to verify in the Read method that the token has submit-job permission. That way the state refresh would fail.

I will check with the Terraform team if they've seen something like this before and if there's a better way forward.

the-nando commented 10 months ago

hey @lgfa29 😄

Yes indeed, the Nomad token used in the second run lacked submit-job, hence the 403 permission denied error returned by Nomad.

Other than doing pre-flight checks in the Create or Read methods, shouldn't the Update return the correct (partial) response? If I undertand correctly the Update method should update the response and diagnostic error accordingly on failure. The response gets a pre-populated state by the plan which needs to be updated to reflect the possibly partial update. https://developer.hashicorp.com/terraform/plugin/framework/resources/update#recommendations Only successfully modified parts of the resource should be return updated data in the state response.

Thanks for checking with the Terraform team, I'm curious to hear if they have any suggestion to better handle this.

the-nando commented 10 months ago

I've just realised that the provider isn't using the plugin framework but the SDK, so what I wrote above doesn't apply unfortunately. I've found this issue which describes the same problem: https://github.com/hashicorp/terraform-plugin-sdk/issues/476

lgfa29 commented 10 months ago

The Terraform team gave a nice suggestion to mark the job update as partial, meaning it will rollback state changes on error: https://github.com/hashicorp/terraform-provider-nomad/pull/412