hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.32k stars 1.72k forks source link

Allow switching from `metadata_startup_script` to `metadata.startup-script` gracefully #9459

Open naphta opened 3 years ago

naphta commented 3 years ago

Community Note

Terraform Version

Terraform v1.0.1 on linux_amd64

Affected Resource(s)

Terraform Configuration Files

I can provide configs here but for simplicities sake:

resource "google_compute_instance" "example" {
    ...
    # metadata_startup_script = "..."
    metadata = {
        startup-script = "..."
    }
    lifecycle = {
        prevent_destroy = true
        ignore_changes = [
            metadata_startup_script
        ]
    }
    ...

Debug Output

│ Error: Error parsing metadata: Cannot provide both metadata_startup_script and metadata.startup-script.
│ 
│   with module.elasticsearch-capture["oregon"].google_compute_instance.elasticsearch-coordinator[0],
│   on ../../modules/elasticsearch/cluster.tf line 152, in resource "google_compute_instance" "elasticsearch-coordinator":
│  152: resource "google_compute_instance" "elasticsearch-coordinator" {

Panic Output

Expected Behavior

metadata_startup_script is unset, and metadata.startup-script is set without destroying the instance.

Actual Behavior

A fatal error is received as the provider is seeing that metadata.startup-script and metadata_startup_script are both populated.

Steps to Reproduce

  1. terraform apply

Important Factoids

I've had a quick debug of the provider and can see that resourceInstanceMetadata() is getting both values back the same. I had a quick play with the underlying struct ResourceData with the GetChange method but that too was returning them as the same value. I'm presuming that this is because they end up being the same value GCP end, or the lifecycle settings interferring.

It's very possible that this could be a strange behaviour with asking it to ignore changes to metadata_startup_script, however when that's unset terraform will want to destroy the existing resource rather than change metadata_startup_script to metadata.startup-script.

References

edwardmedia commented 3 years ago

@naphta this is by design that changing metadata_startup_script leads to trigger replacement. Below is what its doc indicates. Does this make sense?

metadata_startup_script - (Optional) An alternative to using the startup-script metadata key, except this one forces the instance to be recreated (thus re-running the script) if it is changed. This replaces the startup-script metadata key on the created instance and thus the two mechanisms are not allowed to be used simultaneously.

naphta commented 3 years ago

@edwardmedia Absolutely I understand that these two things can't coexist at the same time, however I'm switching from one to the other (as such, not simultaneously). It feels like that should be possible, I fully expect that if I were to change metadata_startup_script to a non null value it should destroy/recreate, but in this case I'm unsetting it (or setting to null), and then simply moving that value to metadata.startup-script.

edwardmedia commented 3 years ago

@naphta it is fine with me switching between startup_script and metadata_startup_script by alternating commenting each line in below testing config. However I noticed that you had prevent_destroy = true. Is this still the case?

resource "google_compute_instance" "default" {
 ...
  metadata = {
    foo = "bar"
    startup_script = "echo hi > /test.txt"
  }

  //metadata_startup_script = "echo hi > /test.txt"
...
}
naphta commented 3 years ago

@edwardmedia I did try that but terraform was still wanting to do the destroy which is undesirable. I'm wanting to swap from one to the other without the instance being destroyed. Is that the case from your example too?

edit: Having just ran it with ignore_changes removed and after prevent_destroy removed it does indeed want to destroy my instances which is a problem.

example diff:

# module.elasticsearch-capture["oregon"].google_compute_instance.elasticsearch-master[2] must be replaced
+/- resource "google_compute_instance" "elasticsearch-master" {
      - allow_stopping_for_update = true -> null
      ~ cpu_platform              = "Intel Broadwell" -> (known after apply)
      ~ current_status            = "RUNNING" -> (known after apply)
      - enable_display            = false -> null
      ~ guest_accelerator         = [] -> (known after apply)
      ~ id                        = "....." -> (known after apply)
      ~ instance_id               = "....." -> (known after apply)
      ~ label_fingerprint         = "....." -> (known after apply)
      ~ metadata                  = {
          + "startup-script" = <<-EOT
                .......
            EOT
        }
      ~ metadata_fingerprint      = "....." -> (known after apply)
      - metadata_startup_script   = <<-EOT
            .....
        EOT -> null # forces replacement
      + min_cpu_platform          = (known after apply)
        name                      = "....."
      ~ project                   = "....." -> (known after apply)
      - resource_policies         = [] -> null
      ~ self_link                 = "....." -> (known after apply)
      ~ tags                      = [
          + "capture",
          + "elasticsearch",
          + "production",
            # (1 unchanged element hidden)
        ]
      ~ tags_fingerprint          = "....." -> (known after apply)
        # (5 unchanged attributes hidden)

      ~ boot_disk {
          ~ device_name                = "persistent-disk-0" -> (known after apply)
          + disk_encryption_key_sha256 = (known after apply)
          + kms_key_self_link          = (known after apply)
          ~ source                     = "....." -> (known after apply)
            # (2 unchanged attributes hidden)

            # (1 unchanged block hidden)
        }

      + confidential_instance_config {
          + enable_confidential_compute = (known after apply)
        }

      ~ network_interface {
          ~ name               = "nic0" -> (known after apply)
          ~ network_ip         = "..... -> (known after apply)
          ~ subnetwork         = "....." -> "platform-oregon"
          ~ subnetwork_project = "....." -> (known after apply)
            # (1 unchanged attribute hidden)
        }

      ~ scheduling {
          ~ automatic_restart   = true -> (known after apply)
          ~ min_node_cpus       = 0 -> (known after apply)
          ~ on_host_maintenance = "MIGRATE" -> (known after apply)
          ~ preemptible         = false -> (known after apply)

          + node_affinities {
              + key      = (known after apply)
              + operator = (known after apply)
              + values   = (known after apply)
            }
        }

      - shielded_instance_config {
          - enable_integrity_monitoring = true -> null
          - enable_secure_boot          = false -> null
          - enable_vtpm                 = true -> null
        }
        # (1 unchanged block hidden)
    }
edwardmedia commented 3 years ago

@naphta this is what I mentioned earlier. When you alter metadata_startup_script, it triggers recreation. This is by design. You can't avoid it via manipulating the config without changing the design. Does it make sense?

I did try that but terraform was still wanting to do the destroy which is undesirable. I'm wanting to swap from one to the other without the instance being destroyed.

naphta commented 3 years ago

So how do you suggest I get myself out of this hole that has been dug?

My expectation is that whenever I changed metadata_startup_script to a non null value that would be the case, but this is a bit of a different scenario since I'm trying to essentially reverse, or back out from using it.

Also the code I took a look at seemed to suggest the behaviour I'm looking for was intentional as the error thrown when both are set is checking that the value is not empty.

I've tried modifying my state (which I'm not a huge fan of), but really it seems crazy to me that I'm resigned to destroying my instance and all of its data to achieve this.

The key part here for me is doing it gracefully.

edwardmedia commented 3 years ago

@naphta i am not sure if there is a good way to achieve it gracefully. @melinath what do you think?

melinath commented 3 years ago

I think we could achieve this with a customize diff func; I'll check with folks whether that's something we would want to do.

naphta commented 3 years ago

I think we could achieve this with a customize diff func; I'll check with folks whether that's something we would want to do.

I'm happy to run the provider locally with a bit of guidance of where to make that change if you're not willing to push it upstream.

melinath commented 3 years ago

I've talked with folks and we're not considering this a bug; the two fields are intentionally meant to be different ways to manage the startup script, and part of that is that switching from metadata_startup_script to metadata.startup_script triggers a recreation.

@naphta If you wanted to make this change, you might be able to do it with a CustomizeDiff function. I believe this would be on the google_compute_instance resource (in google/resource_google_compute_instance.go). There are other examples of CustomizeDiff funcs in the codebase.

But it's not guaranteed that we would accept this change. It would be useful to know more about your use case. Could you elaborate a bit on why recreating the instance would cause problems for you, and why continuing to use metadata_startup_script is not an option?

naphta commented 3 years ago

@melinath It's really a bit of a lack of understanding/poor design our end in writing our original modules. The start-up scripts were doing a fair bit of provisioning which we're moving over to Ansible as for this use case that makes a lot more sense as we can't move the state away from the instances (using scratch disks for performance).

The issue is that the instances created with the start-up scripts exist and are in production with data on them, so there's not a huge desire to go through the process of recreating them (syncing up 5TB of data 3 times, plus moving around production traffic) when really the only reason is to shift some metadata around and put us into a situation in the future where this wouldn't be an issue going forward.

We could continue to use metadata_startup_script, but in this scenario where accidental destruction would be a disaster we're thinking it's best not to use the option where that can happen.

I personally just went in with the expectation that you'd be able to swap between the two options (obviously if you went from metadata.startup-script to metadata_startup_script with changes I'd expect that to trigger a recreation) with low friction. Seeing the line of code where it only throws the error when they both exist but only when one isn't empty/null is what made me consider this a bug.

The only other option I've thought might get me to the state I'd like is by running two applies, one with ignore changes on metadata_startup_script whilst setting it to a null/unset value and then a second apply where the script is placed back but into metadata object directly.

melinath commented 3 years ago

Thanks for the additional details!

The only other option I've thought might get me to the state I'd like is by running two applies, one with ignore changes on metadata_startup_script whilst setting it to a null/unset value and then a second apply where the script is placed back but into metadata object directly.

I was also wondering if that might work. I think it should?

dantheperson commented 3 years ago

I couldn't get it to work with a two stage apply. removing metadata_startup_script while it is still in ignore_changes never removes it from the state. removing it without ignore_changes always want to recreate.

My workaround to this issue is 1) update TF file from metadata_startup_script to metadata.startup-script 2) terraform state rm google_compute_instance.my_vm 3) via GCP console delete the startup-script from the instance 4) terraform import google_compute_instance.my_vm vm-name 5) terraform apply to add the startup script back to the VM.

psyke089 commented 2 years ago

I couldn't get it to work with a two state apply. removing metadata_startup_script while it is still in ignore_changes never removes it from the state. removing it without ignore_changes always want to recreate.

My workaround to this issue is

  1. update TF file from metadata_startup_script to metadata.startup-script
  2. terraform state rm google_compute_instance.my_vm
  3. via GCP console delete the startup-script from the instance
  4. terraform import google_compute_instance.my_vm vm-name
  5. terraform apply to add the startup script back to the VM.

Thanks, worked for me. As a matter of fact step 2 and 4 weren't even needed :) Just had to:

  1. update TF file from metadata_startup_script to metadata.startup-script
  2. via GCP console delete the startup-script from the instance
  3. terraform apply to add the startup script back to the VM.
jbfbell commented 2 years ago

Wanted to add some additional context to this ticket: I ran into an issue today where we created an VM from an instance template, but later wanted to update the metadata to include additional values. It appears that the metadata_startup_script value was set by the instance template despite the template using the metadata block

` metadata = { startup-script = data.local_file.startup-script.content

...

`

Later when someone tried add additional keys to the instance created from template they encountered the error message in the original description.

resource "google_compute_instance_from_template" "vm" {
# ...
  metadata = merge(
    module.vm_template.template.metadata,
    {
      foo = "bar"
    }
  )
}

This change, which is totally unrelated to the startup script, causes the startup-script value of the metadata to be picked up as a change, even though it should have always been there.

Lockszmith-GH commented 2 years ago

An old thread, but I had an issue similar to this, following the discussion here and other part of the web, I ended up modifying the state and only then applying the new metadata.startup-script, this is what I run on the command line:

terraform state pull \
  | jq '
      (.resources[]
        | select (.module == "module.<module name>")
        | .instances[0].attributes).metadata_startup_script = null
    | .serial = .serial+1
    ' \
  | terraform state push -

This modified the state, removing the metadata_start_script attribute, increments the serial, and pushes the new state. I identify by the .module type/name, but you can obviously use any other attribute you want (.name is probably a good choice :monocle_face:)

Next I plan, and apply - and tarraform doesn't destroy my precious deployment.

claudio-vellage commented 4 months ago

For me it's also really unfortunate, because I've tested this before deploying to production, but now after moving it into production with prevent_destroy = true, I'm not able to easily change it. I will probably end up modifying the state as suggested above, but it would be really appreciated if we could allow a smooth transition between those methods. Please close the ticket, if the decision that this will not be changed is final, so I know that I don't have to wait for an update and have to use a workaround.

ggtisc commented 2 months ago

Currently there are both options available as terraform registry explains:

Depending on the OS you choose for your instance, some metadata keys have special functionality. Most linux-based images will run the content of metadata.startup-script in a shell on every boot. At a minimum, Debian, CentOS, RHEL, SLES, Container-Optimized OS, and Ubuntu images support this key. Windows instances require other keys depending on the format of the script and the time you would like it to run - see [this table](https://cloud.google.com/compute/docs/startupscript#providing_a_startup_script_for_windows_instances). For the convenience of the users of metadata.startup-script, we provide a special attribute, metadata_startup_script, which is documented below.