hashicorp / terraform-provider-nomad

Terraform Nomad provider
https://registry.terraform.io/providers/hashicorp/nomad/latest
Mozilla Public License 2.0
142 stars 102 forks source link

Terraform fails to notice when a Nomad job has changed #1

Open hashibot opened 7 years ago

hashibot commented 7 years ago

This issue was originally opened by @blalor as hashicorp/terraform#14038. It was migrated here as part of the provider split. The original body of the issue is below.


Terraform Version

Terraform v0.9.3

Affected Resource(s)

Terraform Configuration Files

provider "nomad" {
    address = "http://localhost:4646"
}

resource "nomad_job" "example" {
    jobspec = "${file("${path.module}/example.nomad")}"
}

Debug Output

Log for 2nd terraform apply: apply.log.txt

Console output:

$ TF_LOG=TRACE TF_LOG_PATH=apply.log.txt terraform apply
nomad_job.example: Refreshing state... (ID: example)

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Expected Behavior

Terraform should have noticed that the job had changed in the cluster and updated it.

Actual Behavior

Nuttin', honey.

Steps to Reproduce

  1. nomad agent -dev

  2. terraform apply

  3. nomad status example shows one allocation for task group grp

  4. change task group count:

    curl -sfS localhost:4646/v1/job/example | \
        jq '.TaskGroups[0].Count = 2 | {"Job": .}' | \
        curl -is -X PUT -d@- localhost:4646/v1/job/example
  5. nomad status example shows two allocations for example.grp

  6. terraform plan shows no change

  7. terraform apply makes no change

  8. nomad status example still shows two allocations

ketzacoatl commented 6 years ago

@paddycarver / @grubernaut / @radeksimko, is this a verified issue/bug? I'm seeing a slightly different issue, but not sure if it's the same as this. If a job has been stopped, or is not running (but the spec has not changed), then the TF nomad provider should update nomad. It might be best to always have the job sent to nomad, and let nomad work out issues with differences in the spec? This issue makes the provider unusable for the most basic use cases.

ketzacoatl commented 6 years ago

The underlying issue here also produces the following behavior:

paddycarver commented 6 years ago

Hi @ketzacoatl! I can't say for sure whether this is a verified bug, nor can I explain the behaviour. I'll try to look into this soon and come back with a bit more information, and a fix if necessary. Apologies for the delays on this.

ketzacoatl commented 6 years ago

hi @paddycarver, thanks for taking a look! were you able to confirm the behavior we see in practice?

shantanugadgil commented 6 years ago

My observation is the same as @ketzacoatl.

It would be just more operator friendly if users could interact via Terraform, rather than nomad status

My versions are Terraform 0.11.5 Nomad (built from master) (0.8.0-dev)

ketzacoatl commented 6 years ago

Any updates on this @paddycarver?

Also, a question: Does the Terraform provider always submit the job to nomad, or does it decide whether or not it should?

kerscher commented 6 years ago

@paddycarver I took a quick look inside the codebase, and it seems the culprit is:

15 by @apparentlymart adds more metadata, but modify_index wouldn’t pick up changes from the provider. I think “Solution 1” below is a nice way of dealing with it.

Solution 1: add status computed metadata to the schema, plus #15 other items

Pros

Cons

Solution 2: Always de-register and re-register

Pros

Cons

ketzacoatl commented 6 years ago

@paddycarver, I'd love to help resolve this problem, as IMO it prevents serious use of Terraform to manage nomad, do you have any guidance or recommendations on how you would like to address the problem?

cc @katbyte / @radeksimko

ketzacoatl commented 6 years ago

@mitchellh I'd be very grateful for your feedback/guidance here.

ketzacoatl commented 5 years ago

@cgbaker, With renewed development efforts here, what are your thoughts on this issue?

cgbaker commented 5 years ago

Hi @ketzacoatl, I wholeheartedly agree that this is something that must be addressed. This shortcoming makes the Nomad provider just about useless for Day 2 operations.

As noted in #15 , there are a few different options for addressing this. The solution partially hinges on whether we want to try to find a general solution that doesn't require modifying and re-releasing the provider every time Nomad releases a new version. On the other hand, with the Nomad product team taking ownership of this TF provider, we can potentially address such a workflow a little better than before. And it may be prudent to find a temporary solution to this issue while we work on a generic/unversioned provider.

The Nomad team is committed to addressing this; I will post an update here soon giving an idea as to the timetable, but my intention is to either resolve this issue as part of the upcoming 1.3.0 version of the provider (targeting Nomad 0.8.x) or the 1.4.0 version (targeting Nomad 0.9.x).

Thank you for your patience and persistence on this issue.

ketzacoatl commented 5 years ago

And it may be prudent to find a temporary solution to this issue while we work on a generic/unversioned provider.

That would be great, WRT being able to continue using the provider while future improvements get worked out.

cgbaker commented 5 years ago

Nomad 0.8.x API support is available in the Nomad v1.3.0 that released today. I will look at finding a longer-term solution for this in upcoming versions; having said that, even if we continue version the Nomad providers, we pledge to be much more responsive in updating the Nomad provider going forward.

ketzacoatl commented 5 years ago

The Nomad team is committed to addressing this; I will post an update here soon giving an idea as to the timetable, but my intention is to either resolve this issue as part of the upcoming 1.3.0 version of the provider (targeting Nomad 0.8.x) or the 1.4.0 version (targeting Nomad 0.9.x).

Hello, just checking back up on this - have there been any improvements related to this?

cgbaker commented 5 years ago

No update as of now, we've been focusing on core Nomad work, finishing the 0.9.0 release.

On Thu, Apr 4, 2019 at 1:02 AM ketzacoatl notifications@github.com wrote:

The Nomad team is committed to addressing this; I will post an update here soon giving an idea as to the timetable, but my intention is to either resolve this issue as part of the upcoming 1.3.0 version of the provider (targeting Nomad 0.8.x) or the 1.4.0 version (targeting Nomad 0.9.x).

Hello, just checking back up on this - have there been any improvements related to this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraform-providers/terraform-provider-nomad/issues/1#issuecomment-479750800, or mute the thread https://github.com/notifications/unsubscribe-auth/ABmPT9wkf1S_J33gI6wrFz0StYLsaQTuks5vdYdQgaJpZM4N5Ctd .

ketzacoatl commented 5 years ago

Fair enough, thanks for the update!

cgbaker commented 5 years ago

Update: this was not resolved in the 1.4.0 release of the provider, but we're still tracking this.

ketzacoatl commented 4 years ago

I am stoked to see this on the 1.5 milestone, rock on!

liemle3893 commented 4 years ago

It have been a year since the last comment. Any update on this?

ketzacoatl commented 4 years ago

@cgbaker WRT the roadmap, are there refactors or internal changes that block fixing this properly?

cgbaker commented 4 years ago

we were waiting for the update 2.0 plugin SDK to see if it helped deal with this. we're actively looking at it now; we want this issue dealt with in the next few months as part of the Nomad 1.0 milestone.

lgfa29 commented 4 years ago

Hi everyone,

This issue has been going for a while, but we are still actively tracking it. I also think we are overdue for a more thorough update, so here is a description of the problem and what we've tried so far (and why it didn't work).

Why this is happening

This issue happens because the job resource doesn’t have separate attributes for each possible field in the jobspec. Instead, it receives a jobspec as a string (or a path to file that is read into a string), and parses some of the fields into computed attributes.

Since all attributes are computed, Terraform is not able to automatically detect changes, so the provider needs a customized diff function to properly communicate to Terraform if anything has changed. Unfortunately Terraform doesn’t provide the current value stored in the state for computed fields to the diff function, so it can't actually detect changes.

What we've tried so far

Here are some of approaches we've tried and why they didn't work.

Declare all jobspec fields as attributes

Having the jobspec as the single attribute forces us to do Terraform’s job of computing the diff, which, as described above, we can’t do properly. Having the full jobspec described as the resource schema would be more in-line with the way Terraform works.

But even though both Nomad and Terraform use HCL, the structures supported by each of them are different, thus creating a new syntax for defining jobs that is not compatible with the nomad job run command nor the Nomad API.

For example, in Nomad a group is defined as:

job "example" {
  group "cache" {
    ...
  }
}

In Terraform this job would be declared as:

resource "nomad_job_new" "example" {
  name = "example"

  group {
    name = "cache"
  }
}

While having the entire jobspec as the single input for the resource is not ideal, it does provide a nice user experience, where the job is defined only once and can be reused in many different places.

You can check the code for this attempt here.

Having jobspec fields defined as optional attributes

If we keep the jobspec attribute but also have the other fields allowed in the job as optional attributes, we could keep the nice user experience of being able to reuse job files but also define some of the values as proper Terraform attributes.

Unfortunately this doesn’t work well because if an attribute is not marked as computed the provider is not able to set a value for it, so anything that is defined in the jobspec but not in the Terraform resource would not be tracked.

Having jobspec fields defined as optional and computed attributes

Terraform does allow attributes to be optional and computed at the same time. But this falls back to the original issue where the current state value for computed attributes are not available to customized diff functions.

I hope this explanation was helpful, and we will provide more updates as we go. In the meantime feel free to sends us any questions, comments or ideas.

Legogris commented 4 years ago

Thanks for the update @lgfa29, it certainly gives clarity on why this is not trivial.

Just curious if you have a current working idea of how this should be addressed or if it's still being explored?

lgfa29 commented 4 years ago

Yes, we are still actively looking for a solution. I will make sure to post new updates, even for failed attempts.

remilapeyre commented 4 years ago

Hi, I've not yet tried it but why not leverage the diff abilities from Nomad to make the diff? Either by exposing it in the Nomad client or adding an endpoint that would return the diff between the new job and the current one (but not actually submitting it).

cgbaker commented 4 years ago

hi @remilapeyre , we already do that for CustomizeDiff: https://github.com/terraform-providers/terraform-provider-nomad/blob/master/nomad/resource_job.go#L518

the issue is that the current Terraform plugin SDK doesn't care about all those computed fields that we get from the Nomad API... we would need to update the jobspec. there are some options there, where we use it as a sentinel, which we're looking into. and we've had some discussion with the Terraform plugin SDK team. We are pursuing a few different options, one of which may just be a new, structured nomad_job resource which doesn't have these problems.

chuckyz commented 4 years ago

@lgfa29 I like the progress in that PR, and I would love to see that. I have zero qualms as a user with cutting off a major version and breaking a lot of compatibility. We already do this today with the Vault provider where some of our modules depend on older versions and others on newer versions.

I am a BIG! +1 on

resource "nomad_job_new" "example" {
  name = "example"

  group {
    name = "cache"
  }
}

I can see how I could modularize this already and produce a terraform module to be consumed upstream from my team with a ton of defaults (e.g.: add sidecars automatically into folks tasks)

Thank you @cgbaker and @lgfa29 for the update. This explains a lot, I had wondered why Nomad's diff features weren't being used here and I did not understand the code itself super well.

lgfa29 commented 4 years ago

Thanks for the feedback @chuckyz. Creating a new resource seems to be the best approach we have right now, so we will move forward with that (but with a better name than nomad_job_new 😄).

Legogris commented 3 years ago

Thanks for the feedback @chuckyz. Creating a new resource seems to be the best approach we have right now, so we will move forward with that (but with a better name than nomad_job_new smile).

Would it make sense to incorporate it in #149?

lgfa29 commented 3 years ago

Thanks for the feedback @chuckyz. Creating a new resource seems to be the best approach we have right now, so we will move forward with that (but with a better name than nomad_job_new smile).

Would it make sense to incorporate it in #149?

Yup. #149 is being developed exactly to solve this issue 👍

eliburke commented 2 years ago

Is there any plan to release v1.5 soon? Is there more work to finish out this feature? #149 hasn't been touched in almost 2 years.

lgfa29 commented 2 years ago

Hi @eliburke 👋

No plans to get this fixed unfortunately. The work on #149 was a brave attempt to try and map all possible jobspec fields into a Terraform resource schema, but that was not a sustainable approach. As Nomad evolves it becomes almost impossible to manually keep up.

We think that a better approach would be to leverage the Nomad OpenAPI project that is able to auto-generate a Nomad job spec, and the Terraform Provider Framework which is a new and more flexible way to create providers.

But this will take a significant effort that is not in our roadmap yet.

shantanugadgil commented 1 year ago

suggestion: to have a quick workaround for this problem, what we have been doing for quite some time is the we render out the entire Nomad job file and use it via the terrafrm-nomad provider (in comple rendered form)

What this does is, when you suspect things, and want to check what is up with the job, you can try out nomad job plan myjob.nomad instead of the usual terraform plan.

In my opinion this provides an easy escape hatch mechanism.

tristanmorgan commented 1 year ago

Can the Submission field in the API now be used? enable support for setting original job source

lgfa29 commented 1 year ago

Hi @tristanmorgan,

Unfortunately that's not enough. The key issue here is describing the Nomad job specification as a Terraform resource schema and detecting changes on individual fields. The resource already has the raw jobspec stored, which would be the same as the value returned by the new job submission endpoint.

jorgemarey commented 9 months ago

Just made a comment on https://github.com/hashicorp/terraform-provider-nomad/pull/238#issuecomment-1833877526 about this.

lgfa29 commented 8 months ago

Reading @jorgemarey comment again, I noticed this part:

We detect changes by having a shasum of the rendered job and seeing if it changes with the value stored for the previous resource version.

While not quite related to this issue, it made me realize that we can use the job submission data (if available) to detect changes to jobspec, which should help mitigate drift problems like this when the submission data is available.

ag-TJNII commented 6 months ago

Is this improved at all with v2 of the provider? I need to take over control of previously hand-generated Nomad job templates with Terraform. Usually when I do this workflow I create a resource in the Terraform code, import the existing resource, and reconcile the differences. However, the fact that this provider doesn't detect differences completely breaks that workflow.

I briefly tried upgrading to v2 provider but still got a unexpectedly large diff, how much of that was this bug and how much was v1 to v2 incompatibilities I didn't determine. I dropped back to v1 to not burn time upgrading to a new version that might have the same major bug.

tgross commented 1 month ago

Hey folks! @gulducat and I did a quick re-assessment of this bug and here's what the current situation is:

After some internal discussion I'm marking this for further roadmapping. We'll update again once we know more.

ketzacoatl commented 1 month ago

@tgross awesome update, thank you for all that info. Even with a big lift in the future, smaller/incremental improvements are very welcome!