hashicorp / packer-plugin-googlecompute

Packer plugin for Google Compute Builder
https://www.packer.io/docs/builders/googlecompute
Mozilla Public License 2.0
26 stars 55 forks source link

Startup script solution gets stuck in infinite loop #215

Open tpdownes opened 7 months ago

tpdownes commented 7 months ago

Overview of the Issue

If the Packer VM is:

then the packer process gets stuck in an infinite loop. The guidance to the user is not very informative. My thoughts:

  1. modify retry.Config to put a limit on the number of Tries or StartTimeout
  2. Improve the guidance to the user at "Error getting startup script status" to help them understand that the service account probably needs the permission to modify its own instance metadata
  3. Whatever process attempts to update the instance metadata should probably have a retry mechanism

These could be done separately. 1 and 2 are probably obvious. The reasoning behind 3 may not be. If you create a service account on Google Cloud and assign it IAM roles, those roles are not immediately applied but have a known propagation delay. Thus an automation pipeline might create the service account, assign it adequate permissions, and nevertheless Packer might fail.

Each timeout might reasonably be 10 minutes to account for worst case propagation delay.

Reproduction Steps

Begin by creating a service account without any IAM roles:

gcloud iam service-accounts create failure \
                                   --description="SA" \
                                   --display-name="failure"

Then supply that project_id and service account to the template below.

Plugin and Packer version

Simplified Packer Buildfile

source "googlecompute" "toolkit_image" {
  project_id            = var.project_id
  communicator          = "none"
  image_name            = "repro-fail"
  machine_type          = "n2-standard-8"
  disk_size             = 32
  disk_type             = "pd-balanced"
  omit_external_ip      = true
  use_internal_ip       = true
  subnetwork            = "default"
  zone                  = "us-central1-c"
  service_account_email = var.service_account_email
  scopes                = ["https://www.googleapis.com/auth/cloud-platform"]
  source_image_family   = "debian-12"
  metadata = {
    startup-script = <<-EOD
      #!/bin/bash
      /bin/true
      EOD
  }
}

build {
  name    = "test"
  sources = ["sources.googlecompute.toolkit_image"]
}

variable "project_id" {
  description = "Project in which to create VM and image"
  type        = string
}

variable "service_account_email" {
  description = "Service account email address"
  type        = string
}

packer {
  required_version = ">= 1.7.9, < 2.0.0"

  # packer plugin 1.0.16 and above includes HPC VM Image
  required_plugins {
    googlecompute = {
      version = "~> 1.1.0"
      source  = "github.com/hashicorp/googlecompute"
    }
  }
}

Log Fragments and crash.log files

tpdownes@poreef ~/repro> packer build -var project_id=my-project -var service_account_email=failure@my-project.iam.gserviceaccount.com .
test.googlecompute.toolkit_image: output will be in this color.

==> test.googlecompute.toolkit_image: Checking image does not exist...
==> test.googlecompute.toolkit_image: Creating temporary RSA SSH key for instance...
==> test.googlecompute.toolkit_image: no persistent disk to create
==> test.googlecompute.toolkit_image: Using image: debian-12-bookworm-v20240312
==> test.googlecompute.toolkit_image: Creating instance...
    test.googlecompute.toolkit_image: Loading zone: us-central1-c
    test.googlecompute.toolkit_image: Loading machine type: n2-standard-8
    test.googlecompute.toolkit_image: Requesting instance creation...
    test.googlecompute.toolkit_image: Waiting for creation operation to complete...
    test.googlecompute.toolkit_image: Instance has been created!
==> test.googlecompute.toolkit_image: Waiting for the instance to become running...
    test.googlecompute.toolkit_image: IP: 10.128.0.10
==> test.googlecompute.toolkit_image: Waiting for any running startup script to finish...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
Cancelling build after receiving interrupt
    test.googlecompute.toolkit_image: Metadata startup-script-status on instance packer-6615be2c-4509-e09b-a563-a2a3fcc15cf6 not available. Waiting...
==> test.googlecompute.toolkit_image: Error waiting for startup script to finish: Error getting startup script status: Instance metadata key, startup-script-status, not found.
tpdownes commented 4 days ago

Another thought: I believe you can eliminate the need for IAM permissions entirely by modifying and polling VM guest attributes rather than instance metadata.

https://cloud.google.com/compute/docs/metadata/manage-guest-attributes